Search Results: "pik"

27 July 2021

Dirk Eddelbuettel: RcppFarmHash 0.0.1: New CRAN Package

A new package RcppFarmHash is now on CRAN in an inaugural version 0.0.1. RcppFarmHash wraps the Google FarmHash family of hash functions (written by Geoff Pike and contributors) that are used for example by Google BigQuery for the FARM_FINGERPRINT. The package was prepared and uploaded yesterday afternoon, and to my surprise already on CRAN this (early) morning when I got up. So here is another #ThankYouCRAN for very smoothing operations. The very brief NEWS entry follows:

Changes in version 0.0.1 (2021-07-25)

Initial version and CRAN upload

If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

22 July 2021

Charles Plessy: Search in Debian's sources

Via my work on the media-types package, I wanted to know which packages were using the media type application/x-xcf, which apparently is not correct (#991158). The https://codesearch.debian.net site gives the answer. (Thanks!) Moreover, one can create a user key, for command-line remote access; here is an example below (the file dcs-apikeyHeader-plessy.txt contains x-dcs-apikey: followed by my access key).

curl -X GET "https://codesearch.debian.net/api/v1/searchperpackage?query=application/x-xcf&match_mode=literal" -H @dcs-apikeyHeader-plessy.txt > result.json

The result is serialised in JSON. Here is how I transformed it to make a list of email addresses that I could easily paste in mutt.

cat result.json  
  jq --raw-output '.[]."package"'  
  dd-list --stdin  
  sed -e '/^ /d' -e '/^$/'d -e 's/$/,/' -e 's/^/  /'

18 July 2021

Jamie McClelland: Google and Bitly

It seems I m the only person on the Internet who didn t know sending email to Google with bit.ly links will tank your deliverability. To my credit, I ve been answering deliverability support questions for 16 years and this has never come up. Until last week. For some reason, at May First we suddenly had about three percent of our email to Google deferred with the ominous sounding:

Our system has detected that this message is 421-4.7.0 suspicious due to the nature of the content and/or the links within.

The quantity of email that accounts for just three percent of mail to Google is high, and caused all kinds of monitoring alarms to go off, putting us into a bit of panic. Eventually we realized all but one of the email messages had bit.ly links. I m still not sure whether this issue was caused by a weird and coincidental spike in users sending bit.ly links to Google. Or whether some subtle change in the Google algorithm is responsible. Or some change in our IP address reputation placed greater emphasis on bit.ly links. In the end it doesn t really matter - the real point is that until we disrupt this growing monopoly we will all be at the mercy of Google and their algorithms for email deliverability (and much, much more).

13 May 2021

Shirish Agarwal: Population, Immigration, Vaccines and Mass-Surveilance.

The Population Issue and its many facets Another couple of weeks passed. A Lot of things happening, lots of anger and depression in folks due to handling in pandemic, but instead of blaming they are willing to blame everybody else including the population. Many of them want forced sterilization like what Sanjay Gandhi did during the Emergency (1975). I had to share So Long, My son . A very moving tale of two families of what happened to them during the one-child policy in China. I was so moved by it and couldn t believe that the Chinese censors allowed it to be produced, shot, edited, and then shared worldwide. It also won a couple of awards at the 69th Berlin Film Festival, silver bear for the best actor and the actress in that category. But more than the award, the theme, and the concept as well as the length of the movie which was astonishing. Over a 3 hr. something it paints a moving picture of love, loss, shame, relief, anger, and asking for forgiveness. All of which can be identified by any rational person with feelings worldwide.

Girl child What was also interesting though was what it couldn t or wasn t able to talk about and that is the Chinese leftover men. In fact, a similar situation exists here in India, only it has been suppressed. This has been more pronounced more in Asia than in other places. One big thing in this is human trafficking and mostly women trafficking. For the Chinese male, that was happening on a large scale from all neighboring countries including India. This has been shared in media and everybody knows about it and yet people are silent. But this is not limited to just the Chinese, even Indians have been doing it. Even yesteryear actress Rupa Ganguly was caught red-handed but then later let off after formal questioning as she is from the ruling party. So much for justice. What is and has been surprising at least for me is Rwanda which is in the top 10 of some of the best places in equal gender. It, along with other African countries have also been in news for putting quite a significant amount of percentage of GDP into public healthcare (between 20-10%), but that is a story for a bit later. People forget or want to forget that it was in Satara, a city in my own state where 220 girls changed their name from nakusha or unwanted to something else and that had become a piece of global news. One would think that after so many years, things would have changed, the only change that has happened is that now we have two ministries, The Ministry of Women and Child Development (MoWCD) and The Ministry of Health and Welfare (MoHFW). Sadly, in both cases, the ministries have been found wanting, Whether it was the high-profile Hathras case or even the routine cries of help which given by women on the twitter helpline. Sadly, neither of these ministries talks about POSH guidelines which came up after the 2012 gangrape case. For both these ministries, it should have been a pinned tweet. There is also the 1994 PCPNDT Act which although made in 1994, actually functioned in 2006, although what happens underground even today nobody knows . On the global stage, about a decade ago, Stephen J. Dubner and Steven Levitt argued in their book Freakonomics how legalized abortion both made the coming population explosion as well as expected crime rates to be reduced. There was a huge pushback on the same from the conservatives and has become a matter of debate, perhaps something that the Conservatives wanted. Interestingly, it hasn t made them go back but go forward as can be seen from the Freakonomics site.

Climate Change Another topic that came up for discussion was repeatedly climate change, but when I share Shell s own 1998 Confidential report titled Greenhouse effect all become strangely silent. The silence here is of two parts, there probably is a large swathe of Indians who haven t read the report and there may be a minority who have read it and know what already has been shared with U.S. Congress. The Conservative s argument has been for it is jobs and a weak we need to research more . There was a partial debunk of it on the TBD podcast by Matt Farell and his brother Sean Farell as to how quickly the energy companies are taking to the coming change.

Health Budget Before going to Covid stories. I first wanted to talk about Health Budgets. From the last 7 years the Center s allocation for health has been between 0.34 to 0.8% per year. That amount barely covers the salaries to the staff, let alone any money for equipment or anything else. And here by allocation I mean, what is actually spent, not the one that is shared by GOI as part of budget proposal. In fact, an article on Wire gives a good breakdown of the numbers. Even those who are on the path of free markets describe India s health business model as a flawed one. See the Bloomberg Quint story on that. Now let me come to Rwanda. Why did I chose Rwanda, I could have chosen South Africa where I went for Debconf 2016, I chose because Rwanda s story is that much more inspiring. In many ways much more inspiring than that South Africa in many ways. Here is a country which for decades had one war or the other, culminating into the Rwanda Civil War which ended in 1994. And coincidentally, they gained independence on a similar timeline as South Africa ending Apartheid in 1994. What does the country do, when it gains its independence, it first puts most of its resources in the healthcare sector. The first few years at 20% of GDP, later than at 10% of GDP till everybody has universal medical coverage. Coming back to the Bloomberg article I shared, the story does not go into the depth of beyond-expiry date medicines, spurious medicines and whatnot. Sadly, most media in India does not cover the deaths happening in rural areas and this I am talking about normal times. Today what is happening in rural areas is just pure madness. For last couple of days have been talking with people who are and have been covering rural areas. In many of those communities, there is vaccine hesitancy and why, because there have been whatsapp forwards sharing that if you go to a hospital you will die and your kidney or some other part of the body will be taken by the doctor. This does two things, it scares people into not going and getting vaccinated, at the same time they are prejudiced against science. This is politics of the lowest kind. And they do it so that they will be forced to go to temples or babas and what not and ask for solutions. And whether they work or not is immaterial, they get fixed and property and money is seized. Sadly, there are not many Indian movies of North which have tried to show it except for oh my god but even here it doesn t go the distance. A much more honest approach was done in Trance . I have never understood how the South Indian movies are able to do a more honest job of story-telling than what is done in Bollywood even though they do in 1/10th the budget that is needed in Bollywood. Although, have to say with OTT, some baggage has been shed but with the whole film certification rearing its ugly head through MEITY orders, it seems two steps backward instead of forward. The idea being simply to infantilize the citizens even more. That is a whole different ball-game which probably will require its own space.

Vaccine issues One good news though is that Vaccination has started. But it has been a long story full of greed by none other than GOI (Government of India) or the ruling party BJP. Where should I start with. I probably should start with this excellent article done by Priyanka Pulla. It is interesting and fascinating to know how vaccines are made, at least one way which she shared. She also shared about the Cutter Incident which happened in the late 50 s. The response was on expected lines, character assassination of her and the newspaper they published but could not critique any of the points made by her. Not a single point that she didn t think about x or y. Interestingly enough, in January 2021 Bharati Biotech was supposed to be share phase 3 trial data but hasn t been put up in public domain till May 2021. In fact, there have been a few threads raised by both well-meaning Indians as well as others globally especially on twitter to which GOI/ICMR (Indian Council of Medical Research) is silent. Another interesting point to note is that Russia did say in its press release that it is possible that their vaccine may not be standard (read inactivation on their vaccines and another way is possible but would take time, again Brazil has objected, but India hasn t till date.) What also has been interesting is the homegrown B.1.617 lineage or known as double mutant . This was first discovered from my own state, Maharashtra and then transported around the world. There is also B.1.618 which was found in West Bengal and is same or supposed to be similar to the one found in South Africa. This one is known as Triple mutant . About B.1.618 we don t know much other than knowing that it is much more easily transferable, much more infectious. Most countries have banned flights from India and I cannot fault them anyway. Hell, when even our diplomats do not care for procedures to be followed during the pandemic then how a common man is supposed to do. Of course, now for next month, Mr. Modi was supposed to go and now will not attend the G7 meeting. Whether, it is because he would have to face the press (the only Prime Minister and the only Indian Prime Minister who never has faced free press.) or because the Indian delegation has been disinvited, we would never know.

A good article which shares lots of lows with how things have been done in India has been an article by Arundhati Roy. And while the article in itself is excellent and shares a bit of the bitter truth but is still incomplete as so much has been happening. The problem is that the issue manifests in so many ways, it is difficult to hold on. As Arundhati shared, should we just look at figures and numbers and hold on, or should we look at individual ones, for e.g. the one shared in Outlook India. Or the one shared by Dr. Dipshika Ghosh who works in Covid ICU in some hospital
Dr. Dipika Ghosh sharing an incident in Covid Ward

Interestingly as well, while in the vaccine issue, Brazil Anvisa doesn t know what they are doing or the regulator just isn t knowledgeable etc. (statements by various people in GOI, when it comes to testing kits, the same is an approver.)
ICMR/DGCI approving internationally validated kits, Press release.

Twitter In the midst of all this, one thing that many people have forgotten and seem to have forgotten that Twitter and other tools are used by only the elite. The reason why the whole thing has become serious now than in the first phase is because the elite of India have also fallen sick and dying which was not the case so much in the first phase. The population on Twitter is estimated to be around 30-34 million and people who are everyday around 20 odd million or so, which is what 2% of the Indian population which is estimated to be around 1.34 billion. The other 98% don t even know that there is something like twitter on which you can ask help. Twitter itself is exclusionary in many ways, with both the emoticons, the language and all sorts of things. There is a small subset who does use Twitter in regional languages, but they are too small to write anything about. The main language is English which does become a hindrance to lot of people.

Censorship Censorship of Indians critical of Govt. mishandling has been non-stop. Even U.S. which usually doesn t interfere into India s internal politics was forced to make an exception. But of course, this has been on deaf ears. There is and was a good thread on Twitter by Gaurav Sabnis, a friend, fellow Puneite now settled in U.S. as a professor.
Gaurav on Trump-Biden on vaccination of their own citizens
Now just to surmise what has been happened in India and what has been happening in most of the countries around the world. Most of the countries have done centralization purchasing of the vaccine and then is distributed by the States, this is what we understand as co-operative federalism. While last year, GOI took a lot of money under the shady PM Cares fund for vaccine purchase, donations from well-meaning Indians as well as Industries and trade bodies. Then later, GOI said it would leave the states hanging and it is they who would have to buy vaccines from the manufacturers. This is again cheap politics. The idea behind it is simple, GOI knows that almost all the states are strapped for cash. This is not new news, this I have shared a couple of months back. The problem has been that for the last 6-8 months no GST meeting has taken place as shared by Punjab s Finance Minister Amarinder Singh. What will happen is that all the states will fight in-between themselves for the vaccine and most of them are now non-BJP Governments. The idea is let the states fight and somehow be on top. So, the pandemic, instead of being a public health issue has become something of on which politics has to played. The news on whatsapp by RW media is it s ok even if a million or two also die, as it is India is heavily populated. Although that argument vanishes for those who lose their dear and near ones. But that just isn t the issue, the issue goes much more deeper than that Oxygen:12%
Remedisivir:12%
Sanitiser:12%
Ventilator:12%
PPE:18%
Ambulances 28% Now all the products above are essential medical equipment and should be declared as essential medical equipment and should have price controls on which GST is levied. In times of pandemic, should the center be profiting on those. States want to let go and even want the center to let go so that some relief is there to the public, while at the same time make them as essential medical equipment with price controls. But GOI doesn t want to. Leaders of opposition parties wrote open letters but no effect. What is sad to me is how Ambulances are being taxed at 28%. Are they luxury items or sin goods ? This also reminds of the recent discovery shared by Mr. Pappu Yadav in Bihar. You can see the color of ambulances as shared by Mr. Yadav, and the same news being shared by India TV news showing other ambulances. Also, the weak argument being made of not having enough drivers. Ideally, you should have 2-3 people, both 9-1-1 and Chicago Fire show 2 people in ambulance but a few times they have also shown to be flipped over. European seems to have three people in ambulance, also they are also much more disciplined as drivers, at least an opinion shared by an American expat.
Pappu Yadav, President Jan Adhikar Party, Bihar May 11, 2021
What is also interesting to note is GOI plays this game of Health is State subject and health is Central subject depending on its convenience. Last year, when it invoked the Epidemic and DMA Act it was a Central subject, now when bodies are flowing down the Ganges and pyres being lit everywhere, it becomes a State subject. But when and where money is involved, it again becomes a Central subject. The States are also understanding it, but they are fighting on too many fronts.
Snippets from Karnataka High Court hearing today, 13th March 2021
One of the good things is most of the High Courts have woken up. Many of the people on the RW think that the Courts are doing Judicial activism . And while there may be an iota of truth in it, the bitter truth is that many judges or relatives or their helpers have diagnosed and some have even died due to Covid. In face of the inevitable, what can they do. They are hauling up local Governments to make sure they are accountable while at the same time making sure that they get access to medical facilities. And I as a citizen don t see any wrong in that even if they are doing it for selfish reasons. Because, even if justice is being done for selfish reasons, if it does improve medical delivery systems for the masses, it is cool. If it means that the poor and everybody else are able to get vaccinations, oxygen and whatever they need, it is cool. Of course, we are still seeing reports of patients spending in the region of INR 50k and more for each day spent in hospital. But as there are no price controls, judges cannot do anything unless they want to make an enemy of the medical lobby in the country. A good story on medicines and what happens in rural areas, see no further than Laakhon mein ek.
Allahabad High Court hauling Uttar Pradesh Govt. for lack of Oxygen is equal to genocide, May 11, 2021
The censorship is not just related to takedown requests on twitter but nowadays also any articles which are critical of the GOI s handling. I have been seeing many articles which have shared facts and have been critical of GOI being taken down. Previously, we used to see 404 errors happen 7-10 years down the line and that was reasonable. Now we see that happen, days weeks or months. India seems to be turning more into China and North Korea and become more anti-science day-by-day

Fake websites Before going into fake websites, let me start with a fake newspaper which was started by none other than the Gujarat CM Mr. Modi in 2005 .
Gujarat Satya Samachar 2005 launched by Mr. Modi.
And if this wasn t enough than on Feb 8, 2005, he had invoked Official Secrets Act
Mr. Modi invoking Official Secrets Act, Feb 8 2005 Gujarat Samachar
The headlines were In Modi s regime press freedom is in peril-Down with Modi s dictatorship. So this was a tried and tested technique. The above information was shared by Mr. Urvish Kothari, who incidentally also has his own youtube channel. Now cut to 2021, and we have a slew of fake websites being done by the same party. In fact, it seems they started this right from 2011. A good article on BBC itself tells the story. Hell, Disinfo.eu which basically combats disinformation in EU has a whole pdf chronicling how BJP has been doing it. Some of the sites it shared are

Times of New York
Manchester Times
Times of Los Angeles
Manhattan Post
Washington Herald
and many more. The idea being take any site name which sounds similar to a brand name recognized by Indians and make fool of them. Of course, those of who use whois and other such tools can easily know what is happening. Two more were added to the list yesterday, Daily Guardian and Australia Today. There are of course, many features which tell them apart from genuine websites. Most of these are on shared hosting rather than dedicated hosting, most of these are bought either from Godaddy and Bluehost. While Bluehost used to be a class act once upon a time, both the above will do anything as long as they get money. Don t care whether it s a fake website or true. Capitalism at its finest or worst depending upon how you look at it. But most of these details are lost on people who do not know web servers, at all and instead think see it is from an exotic site, a foreign site and it chooses to have same ideas as me. Those who are corrupt or see politics as a tool to win at any cost will not see it as evil. And as a gentleman Raghav shared with me, it is so easy to fool us. An example he shared which I had forgotten. Peter England which used to be an Irish brand was bought by Aditya Birla group way back in 2000. But even today, when you go for Peter England, the way the packaging is done, the way the prices are, more often than not, people believe they are buying the Irish brand. While sharing this, there is so much of Naom Chomsky which comes to my mind again and again

Caste Issues I had written about caste issues a few times on this blog. This again came to the fore as news came that a Hindu sect used forced labor from Dalit community to make a temple. This was also shared by the hill. In both, Mr. Joshi doesn t tell that if they were volunteers then why their passports have been taken forcibly, also I looked at both minimum wage prevailing in New Jersey as a state as well as wage given to those who are in the construction Industry. Even in minimum wage, they were giving $1 when the prevailing minimum wage for unskilled work is $12.00 and as Mr. Joshi shared that they are specialized artisans, then they should be paid between $23 $30 per hour. If this isn t exploitation, then I don t know what is. And this is not the first instance, the first instance was perhaps the case against Cisco which was done by John Doe. While I had been busy with other things, it seems Cisco had put up both a demurrer petition and a petition to strike which the Court stayed. This seemed to all over again a type of apartheid practice, only this time applied to caste. The good thing is that the court stayed the petition. Dr. Ambedkar s statement if Hindus migrate to other regions on earth, Indian caste would become a world problem given at Columbia University in 1916, seems to be proven right in today s time and sadly has aged well. But this is not just something which is there only in U.S. this is there in India even today, just couple of days back, a popular actress Munmun Dutta used a casteist slur and then later apologized giving the excuse that she didn t know Hindi. And this is patently false as she has been in the Bollywood industry for almost now 16-17 years. This again, was not an isolated incident. Seema Singh, a lecturer in IIT-Kharagpur abused students from SC, ST backgrounds and was later suspended. There is an SC/ST Atrocities Act but that has been diluted by this Govt. A bit on the background of Dr. Ambedkar can be found at a blog on Columbia website. As I have shared and asked before, how do we think, for what reason the Age of Englightenment or the Age of Reason happened. If I were a fat monk or a priest who was privileges, would I have let Age of Enlightenment happen. It broke religion or rather Church which was most powerful to not so powerful and that power was more distributed among all sort of thinkers, philosophers, tinkers, inventors and so on and so forth.

Situation going forward I believe things are going to be far more complex and deadly before they get better. I had to share another term called Comorbidities which fortunately or unfortunately has also become part of twitter lexicon. While I have shared what it means, it simply means when you have an existing ailment or condition and then Coronavirus attacks you. The Virus will weaken you. The Vaccine in the best case just stops the damage, but the damage already done can t be reversed. There are people who advise and people who are taking steroids but that again has its own side-effects. And this is now, when we are in summer. I am afraid for those who have recovered, what will happen to them during the Monsoons. We know that the Virus attacks most the lungs and their quality of life will be affected. Even the immune system may have issues. We also know about the inflammation. And the grant that has been given to University of Dundee also has signs of worry, both for people like me (obese) as well as those who have heart issues already. In other news, my city which has been under partial lockdown since a month, has been extended for another couple of weeks. There are rumors that the same may continue till the year-end even if it means economics goes out of the window.There is possibility that in the next few months something like 2 million odd Indians could die

The above is a conversation between Karan Thapar and an Oxford Mathematician Dr. Murad Banaji who has shared that the under-counting of cases in India is huge. Even BBC shared an article on the scope of under-counting. Of course, those on the RW call of the evidence including the deaths and obituaries in newspapers as a narrative . And when asked that when deaths used to be in the 20 s or 30 s which has jumped to 200-300 deaths and this is just the middle class and above. The poor don t have the money to get wood and that is the reason you are seeing the bodies in Ganges whether in Buxar Bihar or Gajipur, Uttar Pradesh. The sights and visuals makes for sorry reading
Pandit Ranjan Mishra son on his father s death due to unavailability of oxygen, Varanasi, Uttar Pradesh, 11th May 2021.
For those who don t know Pandit Ranjan Mishra was a renowned classical singer. More importantly, he was the first person to suggest Mr. Modi s name as a Prime Ministerial Candidate. If they couldn t fulfil his oxygen needs, then what can be expected for the normal public.

Conclusion Sadly, this time I have no humorous piece to share, I can however share a documentary which was shared on Feluda . I have shared about Feluda or Prodosh Chandra Mitter a few times on this blog. He has been the answer of James Bond from India. I have shared previously about The Golden Fortress . An amazing piece of art by Satyajit Ray. I watched that documentary two-three times. I thought, mistakenly that I am the only fool or fan of Feluda in Pune to find out that there are people who are even more than me. There were so many facets both about Feluda and master craftsman Satyajit Ray that I was unaware about. I was just simply amazed. I even shared few of the tidbits with mum as well, although now she has been truly hooked to Korean dramas. The only solace from all the surrounding madness. So, if you have nothing to do, you can look up his books, read them and then see the movies. And my first recommendation would be the Golden Fortress. The only thing I would say, do not have high hopes. The movie is beautiful. It starts slow and then picks up speed, just like a train. So, till later. Update The Mass surveillance part I could not do justice do hence removed it at the last moment. It actually needs its whole space, article. There is so much that the Govt. is doing under the guise of the pandemic that it is difficult to share it all in one article. As it is, the article is big

26 April 2021

Vishal Gupta: Ramblings // On Sikkim and Backpacking

What I loved the most about Sikkim can t be captured on cameras. It can t be taped since it would be intrusive and it can t be replicated because it s unique and impromptu. It could be described, as I attempt to, but more importantly, it s something that you simply have to experience to know. Now I first heard about this from a friend who claimed he d been offered free rides and Tropicanas by locals after finishing the Ladakh Marathon. And then I found Ronnie s song, whose chorus goes : Dil hai pahadi, thoda anadi. Par duniya ke maya mein phasta nahi (My heart belongs to the mountains. Although a little childish, it doesn t get hindered by materialism). While the song refers his life in Manali, I think this holds true for most Himalayan states. Maybe it s the pleasant weather, the proximity to nature, the sense of safety from Indian Army being round the corner, independence from material pleasures that aren t available in remote areas or the absence of the pollution, commercialisation, & cutthroat-ness of cities, I don t know, there s just something that makes people in the mountains a lot kinder, more generous, more open and just more alive. Sikkimese people, are honestly some of the nicest people I ve ever met. The blend of Lepchas, Bhutias and the humility and the truthfulness Buddhism ingrains in its disciples is one that ll make you fall in love with Sikkim (assuming the views, the snow, the fab weather and food, leave you pining for more). As a product of Indian parenting, I ve always been taught to be wary of the unknown and to stick to the safer, more-travelled path but to be honest, I enjoy bonding with strangers. To me, each person is a storybook waiting to be flipped open with the right questions and the further I get from home, the wilder the stories get. Besides there s something oddly magical about two arbitrary curvilinear lines briefly running parallel until they diverge to move on to their respective paths. And I think our society has been so busy drawing lines and spreading hate that we forget that in the end, we re all just lines on the universe s infinite canvas. So the next time you travel, and you re in a taxi, a hostel, a bar, a supermarket, or on a long walk to a monastery (that you re secretly wishing is open despite a lockdown), strike up a conversation with a stranger. Small-talk can go a long way.

Header icon made by Freepik from www.flaticon.com

11 April 2021

Vishal Gupta: Sikkim 101 for Backpackers

Host to Kanchenjunga, the world s third-highest mountain peak and the endangered Red Panda, Sikkim is a state in northeastern India. Nestled between Nepal, Tibet (China), Bhutan and West Bengal (India), the state offers a smorgasbord of cultures and cuisines. That said, it s hardly surprising that the old spice route meanders through western Sikkim, connecting Lhasa with the ports of Bengal. Although the latter could also be attributed to cardamom (kali elaichi), a perennial herb native to Sikkim, which the state is the second-largest producer of, globally. Lastly, having been to and lived in India, all my life, I can confidently say Sikkim is one of the cleanest & safest regions in India, making it ideal for first-time backpackers.

Brief History

17th century: The Kingdom of Sikkim is founded by the Namgyal dynasty and ruled by Buddhist priest-kings known as the Chogyal.

1890: Sikkim becomes a princely state of British India.

1947: Sikkim continues its protectorate status with the Union of India, post-Indian-independence.

1973: Anti-royalist riots take place in front of the Chogyal's palace, by Nepalis seeking greater representation.

1975: Referendum leads to the deposition of the monarchy and Sikkim joins India as its 22nd state.

Languages

Official: English, Nepali, Sikkimese/Bhotia and Lepcha

Though Hindi and Nepali share the same script (Devanagari), they are not mutually intelligible. Yet, most people in Sikkim can understand and speak Hindi.

Ethnicity

Nepalis: Migrated in large numbers (from Nepal) and soon became the dominant community

Bhutias: People of Tibetan origin. Major inhabitants in Northern Sikkim.

Lepchas: Original inhabitants of Sikkim

Food

Tibetan/Nepali dishes (mostly consumed during winter)
- Thukpa: Noodle soup, rich in spices and vegetables. Usually contains some form of meat. Common variations: Thenthuk and Gyathuk
- Momos: Steamed or fried dumplings, usually with a meat filling.
- Saadheko: Spicy marinated chicken salad.
- Gundruk Soup: A soup made from Gundruk, a fermented leafy green vegetable.
- Sinki : A fermented radish tap-root product, traditionally consumed as a base for soup and as a pickle. Eerily similar to Kimchi.
While pork and beef are pretty common, finding vegetarian dishes is equally easy.
Staple: Dal-Bhat with Subzi. Rice is a lot more common than wheat (rice) possibly due to greater carb content and proximity to West Bengal, India s largest producer of Rice.
Good places to eat in Gangtok
- Hamro Bhansa Ghar, Nimtho (Nepali)
- Taste of Tibet
- Dragon Wok (Chinese & Japanese)

Buddhism in Sikkim

Bayul Demojong (Sikkim), is the most sacred Land in the Himalayas as per the belief of the Northern Buddhists and various religious texts.

Sikkim was blessed by Guru Padmasambhava, the great Buddhist saint who visited Sikkim in the 8th century and consecrated the land.

However, Buddhism is said to have reached Sikkim only in the 17th century with the arrival of three Tibetan monks viz. Rigdzin Goedki Demthruchen, Mon Kathok Sonam Gyaltshen & Rigdzin Legden Je at Yuksom. Together, they established a Buddhist monastery.

In 1642 they crowned Phuntsog Namgyal as the first monarch of Sikkim and gave him the title of Chogyal, or Dharma Raja.

The faith became popular through its royal patronage and soon many villages had their own monastery.

Today Sikkim has over 200 monasteries.

Major monasteries

Rumtek Monastery, 20Km from Gangtok

Lingdum/Ranka Monastery, 17Km from Gangtok

Phodong Monastery, 28Km from Gangtok

Ralang Monastery, 10Km from Ravangla

Tsuklakhang Monastery, Royal Palace, Gangtok

Enchey Monastery, Gangtok

Tashiding Monastery, 35Km from Ravangla

Reaching Sikkim

Gangtok, being the capital, is easiest to reach amongst other regions, by public transport and shared cabs.

By Air:

Pakyong (PYG) :

Nearest airport from Gangtok (about 1 hour away)

Tabletop airport

Reserved cabs cost around INR 1200.

As of Apr 2021, the only flights to PYG are from IGI (Delhi) and CCU (Kolkata).

Bagdogra (IXB) :

About 20 minutes from Siliguri and 4 hours from Gangtok.

Larger airport with flights to most major Indian cities.

Reserved cabs cost about INR 3000. Shared cabs cost about INR 350.

By Train:

New Jalpaiguri (NJP) :

About 20 minutes from Siliguri and 4 hours from Gangtok.

Reserved cabs cost about INR 3000. Shared cabs from INR 350.

By Road:

NH10 connects Siliguri to Gangtok

If you can t find buses plying to Gangtok directly, reach Siliguri and then take a cab to Gangtok.

Sikkim Nationalised Transport Div. also runs hourly buses between Siliguri and Gangtok and daily buses on other common routes. They re cheaper than shared cabs.

Wizzride also operates shared cabs between Siliguri/Bagdogra/NJP, Gangtok and Darjeeling. They cost about the same as shared cabs but pack in half as many people in luxury cars (Innova, Xylo, etc.) and are hence more comfortable.

Gangtok

Time needed: 1D/1N

Places to visit:

Hanuman Tok

Ganesh Tok

Tashi View Point [6,800ft]

MG Marg

Sikkim Zoo

Gangtok Ropeway

Enchey Monastery

Tsuklakhang Palace & Monastery

Hostels: Tagalong Backpackers (would strongly recommend), Zostel Gangtok

Places to chill: Travel Cafe, Caf Live & Loud and Gangtok Groove

Places to shop: Lal Market and MG Marg

Getting Around

Taxis operate on a reserved or shared basis. In case of the latter, you can pool with other commuters your taxis will pick up and drop en-route.

Naturally shared taxis only operate on popular routes. The easiest way to get around Gangtok is to catch a shared cab from MG Marg.

Reserved taxis for Gangtok sightseeing cost around INR 1000-1500, depending upon the spots you d like to see

Key taxi/bus stands :

Deorali stand: For Darjeeling, Siliguri, Kalimpong

Vajra stand: For North & East Sikkim (Tsomgo Lake & Nathula)

Rumtek taxi: For Ravangla, Pelling, Namchi, Geyzing, Jorethang and Singtam.

Exploring Gangtok on an MTB

If you'd like to explore Gangtok at your own pace, you could rent a cycle from Hub Outdoor or HillBike. I paid INR 600 for a day at HillBike.

These are well-maintained geared mountain bikes, so adapting to inclines and declines should be a breeze, even for beginners.

Cycling down from Ganesh Tok to Tashi View Point is an absolute delight and in my opinion, the best thing you can do in Gangtok.

While it is a little unconventional and exhausting (uphill stretches), but a 100% recommended unless you ve been skipping leg day

Pro tip : Make sure you use the right/back brake while decelerating downhill and not the left/front brake.

Recommended route [3-5hrs, 20.3Km, 1770-2081ft] : Chandmari (HillBike) -> Sera Jhe Dro-Phen-Ling Monastery > Ganesh Tok > Tashi View Point > Bakthung Falls > Vajra Stand > Tsuklakhang Palace & Monastery > Enchey Monastery > Chandmari

North Sikkim

The easiest & most economical way to explore North Sikkim is the 3D/2N package offered by shared-cab drivers.

This includes food, permits, cab rides and accommodation (1N in Lachen and 1N in Lachung)

The accommodation on both nights are at homestays with bare necessities, so keep your hopes low.

In the spirit of sustainable tourism, you ll be asked to discard single-use plastic bottles, so please carry a bottle that you can refill along the way.

Zero Point and Gurdongmer Lake are snow-capped throughout the year

3D/2N Shared-cab Package Itinerary

Day 1

Gangtok (10am) - Chungthang - Lachung (stay)

Day 2

Pre-lunch : Lachung (6am) - Yumthang Valley [12,139ft] - Zero Point - Lachung [15,300ft]

Post-lunch : Lachung - Chungthang - Lachen (stay)

Day 3

Pre-lunch : Lachen (5am) - Kala Patthar - Gurdongmer Lake [16,910ft] - Lachen

Post-lunch : Lachen - Chungthang - Gangtok (7pm)

This itinerary is idealistic and depends on the level of snowfall.

Some drivers might switch up Day 2 and 3 itineraries by visiting Lachen and then Lachung, depending upon the weather.

Areas beyond Lachen & Lachung are heavily militarized since the Indo-China border is only a few miles away.

East Sikkim

Zuluk and Silk Route

Time needed: 2D/1N

Zuluk [9,400ft] is a small hamlet with an excellent view of the eastern Himalayan range including the Kanchenjunga.

Was once a transit point to the historic Silk Route from Tibet (Lhasa) to India (West Bengal).

The drive from Gangtok to Zuluk takes at least four hours. Hence, it makes sense to spend the night at a homestay and space out your trip to Zuluk

Tsomgo Lake and Nathula

Time Needed : 1D

A Protected Area Permit is required to visit these places, due to their proximity to the Chinese border

Tsomgo/Chhangu Lake [12,313ft]

Glacial lake, 40 km from Gangtok.

Remains frozen during the winter season.

You can also ride on the back of a Yak for INR 300

Baba Mandir

An old temple dedicated to Baba Harbhajan Singh, a Sepoy in the 23rd Regiment, who died in 1962 near the Nathu La during Indo China war.

Nathula Pass [14,450ft]

Located on the Indo-Tibetan border crossing of the Old Silk Route, it is one of the three open trading posts between India and China.

Plays a key role in the Sino-Indian Trade and also serves as an official Border Personnel Meeting(BPM) Point.

May get cordoned off by the Indian Army in event of heavy snowfall or for other security reasons.

West Sikkim

Time needed: 3N/1N

Hostels at Pelling : Mochilerro Ostillo

Itinerary

Day 1: Gangtok - Ravangla - Pelling

Leave Gangtok early, for Ravangla through the Temi Tea Estate route.

Spend some time at the tea garden and then visit Buddha Park at Ravangla

Head to Pelling from Ravangla

Day 2: Pelling sightseeing

Hire a cab and visit Skywalk, Pemayangtse Monastery, Rabdentse Ruins, Kecheopalri Lake, Kanchenjunga Falls.

Day 3: Pelling - Gangtok/Siliguri

Wake up early to catch a glimpse of Kanchenjunga at the Pelling Helipad around sunrise

Head back to Gangtok on a shared-cab

You could take a bus/taxi back to Siliguri if Pelling is your last stop.

Darjeeling

In my opinion, Darjeeling is lovely for a two-day detour on your way back to Bagdogra/Siliguri and not any longer (unless you re a Bengali couple on a honeymoon)

Once a part of Sikkim, Darjeeling was ceded to the East India Company after a series of wars, with Sikkim briefly receiving a grant from EIC for gifting Darjeeling to the latter

Post-independence, Darjeeling was merged with the state of West Bengal.

Itinerary

Day 1 :

Take a cab from Gangtok to Darjeeling (shared-cabs cost INR 300 per seat)

Reach Darjeeling by noon and check in to your Hostel. I stayed at Hideout.

Spend the evening visiting either a monastery (or the Batasia Loop), Nehru Road and Mall Road.

Grab dinner at Glenary whilst listening to live music.

Day 2:

Wake up early to catch the sunrise and a glimpse of Kanchenjunga at Tiger Hill. Since Tiger Hill is 10km from Darjeeling and requires a permit, book your taxi in advance.

Alternatively, if you don t want to get up at 4am or shell out INR1500 on the cab to Tiger Hill, walk to the Kanchenjunga View Point down Mall Road

Next, queue up outside Keventers for breakfast with a view in a century-old cafe

Get a cab at Gandhi Road and visit a tea garden (Happy Valley is the closest) and the Ropeway. I was lucky to meet 6 other backpackers at my hostel and we ended up pooling the cab at INR 200 per person, with INR 1400 being on the expensive side, but you could bargain.

Get lunch, buy some tea at Golden Tips, pack your bags and hop on a shared-cab back to Siliguri. It took us about 4hrs to reach Siliguri, with an hour to spare before my train.

If you ve still got time on your hands, then check out the Peace Pagoda and the Darjeeling Himalayan Railway (Toy Train). At INR 1500, I found the latter to be too expensive and skipped it.

Tips and hacks

Download offline maps, especially when you re exploring Northern Sikkim.

Food and booze are the cheapest in Gangtok. Stash up before heading to other regions.

Keep your Aadhar/Passport handy since you need permits to travel to North & East Sikkim.

In rural areas and some cafes, you may get to try Rhododendron Wine, made from Rhododendron arboreum a.k.a Gurans. Its production is a little hush-hush since the flower is considered holy and is also the National Flower of Nepal.

If you don t want to invest in a new jacket, boots or a pair of gloves, you can always rent them at nominal rates from your hotel or little stores around tourist sites.

Check the weather of a region before heading there. Low visibility and precipitation can quite literally dampen your experience.

Keep your itinerary flexible to accommodate for rest and impromptu plans.

Shops and restaurants close by 8pm in Sikkim and Darjeeling. Plan for the same.

Carry

a couple of extra pairs of socks (woollen, if possible)

a pair of slippers to wear indoors

a reusable water bottle

an umbrella

a power bank

a couple of tablets of Diamox. Helps deal with altitude sickness

extra clothes and wet bags since you may not get a chance to wash/dry your clothes

a few passport size photographs

Shared-cab hacks

Intercity rides can be exhausting. If you can afford it, pay for an additional seat.

Call shotgun on the drives beyond Lachen and Lachung. The views are breathtaking.

Return cabs tend to be cheaper (WB cabs travelling from SK and vice-versa)

Cost

My median daily expenditure (back when I went to Sikkim in early March 2021) was INR 1350.

This includes stay (bunk bed), food, wine and transit (shared cabs)

In my defence, I splurged on food, wine and extra seats in shared cabs, but if you re on a budget, you could easily get by on INR 1 - 1.2k per day.

For a 9-day trip, I ended up shelling out nearly INR 15k, including 2AC trains to & from Kolkata

Note : Summer (March to May) and Autumn (October to December) are peak seasons, and thereby more expensive to travel around.

Souvenirs and things you should buy

Buddhist souvenirs :

Colourful Prayer Flags (great for tying on bikes or behind car windshields)

Miniature Prayer/Mani Wheels

Lucky Charms, Pendants and Key Chains

Cham Dance masks and robes

Singing Bowls

Common symbols: Om mani padme hum, Ashtamangala, Zodiac signs

Handicrafts & Handlooms

Tibetan Yak Wool shawls, scarfs and carpets

Sikkimese Ceramic cups

Thangka Paintings

Edibles

Darjeeling Tea (usually brewed and not boiled)

Wine (Arucha Peach & Rhododendron)

Dalle Khursani (Chilli) Paste and Pickle

Header Icon made by Freepik from www.flaticon.com is licensed by CC 3.0 BY

6 March 2021

Michael Stapelberg: Debian Code Search: OpenAPI now available

Debian Code Search now offers an OpenAPI-based API! Various developers have created ad-hoc client libraries based on how the web interface works. The goal of offering an OpenAPI-based API is to provide developers with automatically generated client libraries for a large number of programming languages, that target a stable interface independent of the web interface s implementation details.

Getting started

Visit https://codesearch.debian.net/apikeys/ to download your personal API key. Login via Debian s GitLab instance salsa.debian.org; register there if you have no account yet.

Find the Debian Code Search client library for your programming language. If none exists yet, auto-generate a client library on editor.swagger.io: click Generate Client .

Search all code in Debian from your own analysis tool, migration tracking dashboard, etc.

curl example

curl \
  -H "x-dcs-apikey: $(cat dcs-apikey-stapelberg.txt)" \
  -X GET \
  "https://codesearch.debian.net/api/v1/search?query=i3Font&match_mode=regexp"

Web browser example You can try out the API in your web browser in the OpenAPI documentation.

Code example (Go) Here s an example program that demonstrates how to set up an auto-generated Go client for the Debian Code Search OpenAPI, run a query, and aggregate the results:

func burndown() error  
	cfg := openapiclient.NewConfiguration()
	cfg.AddDefaultHeader("x-dcs-apikey", apiKey)
	client := openapiclient.NewAPIClient(cfg)
	ctx := context.Background()
	// Search through the full Debian Code Search corpus, blocking until all
	// results are available:
	results, _, err := client.SearchApi.Search(ctx, "fmt.Sprint(err)", &openapiclient.SearchApiSearchOpts 
		// Literal searches are faster and do not require escaping special
		// characters, regular expression searches are more powerful.
		MatchMode: optional.NewString("literal"),
	 )
	if err != nil  
		return err
	 
	// Print to stdout a CSV file with the path and number of occurrences:
	wr := csv.NewWriter(os.Stdout)
	header := []string "path", "number of occurrences" 
	if err := wr.Write(header); err != nil  
		return err
	 
	occurrences := make(map[string]int)
	for _, result := range results  
		occurrences[result.Path]++
	 
	for _, result := range results  
		o, ok := occurrences[result.Path]
		if !ok  
			continue
		 
		// Print one CSV record per path:
		delete(occurrences, result.Path)
		record := []string result.Path, strconv.Itoa(o) 
		if err := wr.Write(record); err != nil  
			return err
		 
	 
	wr.Flush()
	return wr.Error()

The full example can be found under burndown.go.

Feedback? File a GitHub issue on `github.com/Debian/dcs` please!

Migration status I m aware of the following third-party projects using Debian Code Search:

Tool Migration status

Debian Code Search CLI tool Updated to OpenAPI

identify-incomplete-xs-go-import-path Update pending

gnome-codesearch makes no API queries

If you find any others, please point them to this post in case they are not using Debian Code Search s OpenAPI yet.

Tool	Migration status
Debian Code Search CLI tool	Updated to OpenAPI
identify-incomplete-xs-go-import-path	Update pending
gnome-codesearch	makes no API queries

23 September 2020

Vincent Fourmond: Tutorial: analyze Km data of CODHs

This is the first post of a series in which we will provide the readers with simple tutorial approaches to reproduce the data analysis of some of our published papers. All our data analysis is performed using QSoas. Today, we will show you how to analyze the experiments we used to characterize the behaviour of an enzyme, the Nickel-Iron CO dehydrogenase IV from Carboxytothermus hydrogenoformans. The experiments we analyze here are described in much more details in the original publication, Domnik et al, Angewandte Chemie, 2017. The only things you need to know for now are the following:

the data is the evolution of the current over time given by the absorbed enzyme;
at a given point, a certain amount of CO (50 M) is injected, and its concentration decreases exponentially over time;
the enzyme has a Michaelis-Menten behaviour with respect to CO.

This means that we expect a response of the type: $$i(t) = \frac i_m 1 + \frac K_m [\mathrm CO ](t) $$ in which $$[\mathrm CO ](t) = \begin cases 0, & \text for t < t_0 \\ C_0 \exp \frac t_0 - t \tau , & \text for t\geq t_0 %> \end cases $$ To begin this tutorial, first download the files from the github repository (direct links: data, parameter file and ruby script). Start QSoas, go to the directory where you saved the files, load the data file, and remove spikes in the data using the following commands:

QSoas> cd
QSoas> l Km-CODH-IV.dat
QSoas> R

First fitThen, to fit the above equation to the data, the simplest is to take advantage of the time-dependent parameters features of QSoas. Run simply:

QSoas> fit-arb im/(1+km/s) /with=s:1,exp

This simply launches the fit interface to fit the exact equations above. The im/(1+km/s) is simply the translation of the Michaelis-Menten equation above, and the /with=s:1,exp specifies that s is the result of the sum of 1 exponential like for the definition of above. Then, load the Km-CODH-IV.params parameter files (using the "Parameters.../Load from file" action at the bottom, or the Ctrl+L keyboard shortcut). Your window should now look like this:

To fit the data, just hit the "Fit" button ! (or Ctrl+F). Including an offset The fit is not bad, but not perfect. In particular, it is easy to see why: the current predicted by the fit goes to 0 at large times, but the actual current is below 0. We need therefore to include an offset to take this into consideration. Close the fit window, and re-run a fit, but now with this command:

QSoas> fit-arb im/(1+km/s)+io /with=s:1,exp

Notice the +io bit that corresponds to the addition of an offset current. Load again the base parameters, run the fit again... Your fit window show now look like:

See how the offset current is now much better taken into account. Let's talk a bit more about the parameters:

im is $i_m$, the maximal current, around 120 nA (which matches the magnitude of the original current).
io is the offset current, around -3nA.
km is the $K_m$, expressed in the same units as s_1, the first "injected" value of s (we used 50 because the injection is 50 M CO). That means the value of $K_m$ determined by this fit is around 9 nM ! (but see below).
s_t_1 is the time of the injection of CO (it was in the parameter files you loaded).
Finally, s_tau_1 is the time constant of departure of CO, noted $\tau$ in the equations above.

Taking into account mass-transport limitations However, the fit is still unsatisfactory: the predicted curve fails to reproduce the curvature at the beginning and at the end of the decrease. This is due to issues linked to mass-transport limitations, which are discussed in details in Merrouch et al, Electrochimica Acta, 2017. In short, what you need to do is to close the fit window again, load the transport.rb Ruby file that contains the definition of the itrpt function, and re-launch the fit window using:

QSoas> ruby-run transport.rb
QSoas> fit-arb itrprt(s,km,nFAm,nFAmu)+io /with=s:1,exp

Load again the parameter file... but this time you'll have to play a bit more with the starting parameters for QSoas to find the right values when you fit. Here are some tips:

the curve predicted with the current parameters (use "Update" to update the display) should "look like" the data;
apart from io, no parameter should be negative;
there may be hints about the correct values in the papers...

A successful fit should look like this:

Here you are ! I hope you enjoyed analyzing our data, and that it will help you analyze yours ! Feel free to comment and ask for clarifications.

About QSoasQSoas is a powerful open source data analysis program that focuses on flexibility and powerful fitting capacities. It is released under the GNU General Public License. It is described in Fourmond, Anal. Chem., 2016, 88 (10), pp 5050 5052. Current version is 2.2. You can download its source code or buy precompiled versions for MacOS and Windows there.

18 September 2020

Sven Hoexter: Avoiding the GitHub WebUI

Now that GitHub released v1.0 of the gh cli tool, and this is all over HN, it might make sense to write a note about my clumsy aliases and shell functions I cobbled together in the past month. Background story is that my dayjob moved to GitHub coming from Bitbucket. From my point of view the WebUI for Bitbucket is mediocre, but the one at GitHub is just awful and painful to use, especially for PR processing. So I longed for the terminal and ended up with gh and wtfutil as a dashboard. The setup we have is painful on its own, with several orgs and repos which are more like monorepos covering several corners of infrastructure, and some which are very focused on a single component. All workflows are anti GitHub workflows, so you must have permission on the repo, create a branch in that repo as a feature branch, and open a PR for the merge back into master. gh functions and aliases

# setup a token with perms to everything, dealing with SAML is a PITA
export GITHUB_TOKEN="c0ffee4711"
# I use a light theme on my terminal, so adjust the gh theme
export GLAMOUR_STYLE="light"
#simple aliases to poke at a PR
alias gha="gh pr review --approve"
alias ghv="gh pr view"
alias ghd="gh pr diff"
### github support functions, most invoked with a PR ID as $1
#primary function to review PRs
function ghs  
    gh pr view $ 1 
    gh pr checks $ 1 
    gh pr diff $ 1 
 
# very custom PR create function relying on ORG and TEAM settings hard coded
# main idea is to create the PR with my team directly assigned as reviewer
function ghc  
    if git status   grep -q 'Untracked'; then
        echo "ERROR: untracked files in branch"
        git status
        return 1
    fi
    git push --set-upstream origin HEAD
    gh pr create -f -r "$(git remote -v   grep push   grep -oE 'myorg-[a-z]+')/myteam"
 
# merge a PR and update master if we're not in a different branch
function ghm  
    gh pr merge -d -r $ 1 
    if [[ "$(git rev-parse --abbrev-ref HEAD)" == "master" ]]; then
        git pull
    fi
 
# get an overview over the files changed in a PR
function ghf  
    gh pr diff $ 1    diffstat -l
 
# generate a link to a commit in the WebUI to pass on to someone else
# input is a git commit hash
function ghlink  
    local repo="$(git remote -v   grep -E "github.+push"   cut -d':' -f 2   cut -d'.' -f 1)"
    echo "https://github.com/$ repo /commit/$ 1 "

wtfutil I have a terminal covering half my screensize with small dashboards listing PRs for the repos I care about. For other repos I reverted back to mail notifications which get sorted and processed from time to time. A sample dashboard config looks like this:

github_admin:
  apiKey: "c0ffee4711"
  baseURL: ""
  customQueries:
    othersPRs:
      title: "Pull Requests"
      filter: "is:open is:pr -author:hoexter -label:dependencies"
  enabled: true
  enableStatus: true
  showOpenReviewRequests: false
  showStats: false
  position:
    top: 0
    left: 0
    height: 3
    width: 1
  refreshInterval: 30
  repositories:
    - "myorg/admin"
  uploadURL: ""
  username: "hoexter"
  type: github

The -label:dependencies is used here to filter out dependabot PRs in the dashboard. Workflow Look at a PR with ghv $ID, if it's ok ACK it with gha $ID. Create a PR from a feature branch with ghc and later on merge it with ghm $ID. The $ID is retrieved from looking at my wtfutil based dashboard. Security Considerations The world is full of bad jokes. For the WebUI access I've the full array of pain with SAML auth, which expires too often, and 2nd factor verification for my account backed by a Yubikey. But to work with the CLI you basically need an API token with full access, everything else drives you insane. So I gave in and generated exactly that. End result is that I now have an API token - which is basically a password - which has full power, and is stored in config files and environment variables. So the security features created around the login are all void now. Was that the aim of it after all?

17 May 2020

Matthew Palmer: Private Key Redaction: UR DOIN IT RONG

Because posting private keys on the Internet is a bad idea, some people like to redact their private keys, so that it looks kinda-sorta like a private key, but it isn t actually giving away anything secret. Unfortunately, due to the way that private keys are represented, it is easy to redact a key in such a way that it doesn t actually redact anything at all. RSA private keys are particularly bad at this, but the problem can (potentially) apply to other keys as well. I ll show you a bit of Inside Baseball with key formats, and then demonstrate the practical implications. Finally, we ll go through a practical worked example from an actual not-really-redacted key I recently stumbled across in my travels.

The Private Lives of Private Keys Here is what a typical private key looks like, when you come across it:

Obviously, there s some hidden meaning in there computers don t encrypt things by shouting BEGIN RSA PRIVATE KEY! , after all. What is between the BEGIN/END lines above is, in fact, a base64-encoded DER format ASN.1 structure representing a PKCS#1 private key. In simple terms, it s a list of numbers very important numbers. The list of numbers is, in order:

A version number (0);
The public modulus , commonly referred to as n ;
The public exponent , or e (which is almost always 65,537, for various unimportant reasons);
The private exponent , or d ;
The two private primes , or p and q ;
Two exponents, which are known as dmp1 and dmq1 ; and
A coefficient, known as iqmp .

Why Is This a Problem? The thing is, only three of those numbers are actually required in a private key. The rest, whilst useful to allow the RSA encryption and decryption to be more efficient, aren t necessary. The three absolutely required values are e, p, and q. Of the other numbers, most of them are at least about the same size as each of p and q. So of the total data in an RSA key, less than a quarter of the data is required. Let me show you with the above toy key, by breaking it down piece by piece¹:

MGI DER for this is a sequence
CAQ version (0)
CxjdTmecltJEz2PLMpS4BX n
AgMBAA e
ECEDKtuwD17gpagnASq1zQTY d
ECCQDVTYVsjjF7IQ p
IJANUYZsIjRsR3 q
AgkAkahDUXL0RS dmp1
ECCB78r2SnsJC9 dmq1
AghaOK3FsKoELg== iqmp

Remember that in order to reconstruct all of these values, all I need are e, p, and q and e is pretty much always 65,537. So I could redact almost all of this key, and still give all the important, private bits of this key. Let me show you:

-----BEGIN RSA PRIVATE KEY-----
..............................................................EC
CQDVTYVsjjF7IQIJANUYZsIjRsR3....................................
........
-----END RSA PRIVATE KEY-----

Now, I doubt that anyone is going to redact a key precisely like this but then again, this isn t a typical RSA key. They usually look a lot more like this:

People typically redact keys by deleting whole lines, and usually replacing them with [...] and the like. But only about 345 of those 1588 characters (excluding the header and footer) are required to construct the entire key. You can redact about 4/5ths of that giant blob of stuff, and your private parts (or at least, those of your key) are still left uncomfortably exposed.

But Wait! There s More! Remember how I said that everything in the key other than `e`, `p`, and `q` could be derived from those three numbers? Let s talk about one of those numbers: `n`. This is known as the public modulus (because, along with `e`, it is also present in the public key). It is very easy to calculate: `n = p * q`. It is also very early in the key (the second number, in fact). Since `n = p * q`, it follows that `q = n / p`. Thus, as long as the key is intact up to `p`, you can derive `q` by simple division.

Real World Redaction At this point, I d like to introduce an acquaintance of mine: Mr. Johan Finn. He is the proud owner of the GitHub repo johanfinn/scripts. For a while, his repo contained a script that contained a poorly-redacted private key. He since deleted it, by making a new commit, but of course because git never really deletes anything, it s still available. Of course, Mr. Finn may delete the repo, or force-push a new history without that commit, so here is the redacted private key, with a bit of the surrounding shell script, for our illustrative pleasure:

#Add private key to .ssh folder
cd /home/johan/.ssh/
echo  "-----BEGIN RSA PRIVATE KEY-----
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
 
MIIJKgIBAAKCAgEAxEVih1JGb8gu/Fm4AZh+ZwJw/pjzzliWrg4mICFt1g7SmIE2
TCQMKABdwd11wOFKCPc/UzRH/fHuQcvWrpbOSdqev/zKff9iedKw/YygkMeIRaXB
fYELqvUAOJ8PPfDm70st9GJRhjGgo5+L3cJB2gfgeiDNHzaFvapRSU0oMGQX+kI9
ezsjDAn+0Pp+r3h/u1QpLSH4moRFGF4omNydI+3iTGB98/EzuNhRBHRNq4oBV5SG
Pq/A1bem2ninnoEaQ+OPESxYzDz3Jy9jV0W/6LvtJ844m+XX69H5fqq5dy55z6DW
sGKn78ULPVZPsYH5Y7C+CM6GAn4nYCpau0t52sqsY5epXdeYx4Dc+Wm0CjXrUDEe
Egl4loPKDxJkQqQ/MQiz6Le/UK9vEmnWn1TRXK3ekzNV4NgDfJANBQobOpwt8WVB
rbsC0ON7n680RQnl7PltK9P1AQW5vHsahkoixk/BhcwhkrkZGyDIl9g8Q/Euyoq3
eivKPLz7/rhDE7C1BzFy7v8AjC3w7i9QeHcWOZFAXo5hiDasIAkljDOsdfD4tP5/
wSO6E6pjL3kJ+RH2FCHd7ciQb+IcuXbku64ln8gab4p8jLa/mcMI+V3eWYnZ82Yu
axsa85hAe4wb60cp/rCJo7ihhDTTvGooqtTisOv2nSvCYpcW9qbL6cGjAXECAwEA
AQKCAgEAjz6wnWDP5Y9ts2FrqUZ5ooamnzpUXlpLhrbu3m5ncl4ZF5LfH+QDN0Kl
KvONmHsUhJynC/vROybSJBU4Fu4bms1DJY3C39h/L7g00qhLG7901pgWMpn3QQtU
4P49qpBii20MGhuTsmQQALtV4kB/vTgYfinoawpo67cdYmk8lqzGzzB/HKxZdNTq
s+zOfxRr7PWMo9LyVRuKLjGyYXZJ/coFaobWBi8Y96Rw5NZZRYQQXLIalC/Dhndm
AHckpstEtx2i8f6yxEUOgPvV/gD7Akn92RpqOGW0g/kYpXjGqZQy9PVHGy61sInY
HSkcOspIkJiS6WyJY9JcvJPM6ns4b84GE9qoUlWVF3RWJk1dqYCw5hz4U8LFyxsF
R6WhYiImvjxBLpab55rSqbGkzjI2z+ucDZyl1gqIv9U6qceVsgRyuqdfVN4deU22
LzO5IEDhnGdFqg9KQY7u8zm686Ejs64T1sh0y4GOmGsSg+P6nsqkdlXH8C+Cf03F
lqPFg8WQC7ojl/S8dPmkT5tcJh3BPwIWuvbtVjFOGQc8x0lb+NwK8h2Nsn6LNazS
0H90adh/IyYX4sBMokrpxAi+gMAWiyJHIHLeH2itNKtAQd3qQowbrWNswJSgJzsT
JuJ7uqRKAFkE6nCeAkuj/6KHHMPsfCAffVdyGaWqhoxmPOrnVgECggEBAOrCCwiC
XxwUgjOfOKx68siFJLfHf4vPo42LZOkAQq5aUmcWHbJVXmoxLYSczyAROopY0wd6
Dx8rqnpO7OtZsdJMeBSHbMVKoBZ77hiCQlrljcj12moFaEAButLCdZFsZW4zF/sx
kWIAaPH9vc4MvHHyvyNoB3yQRdevu57X7xGf9UxWuPil/jvdbt9toaraUT6rUBWU
GYPNKaLFsQzKsFWAzp5RGpASkhuiBJ0Qx3cfLyirjrKqTipe3o3gh/5RSHQ6VAhz
gdUG7WszNWk8FDCL6RTWzPOrbUyJo/wz1kblsL3vhV7ldEKFHeEjsDGroW2VUFlS
asAHNvM4/uYcOSECggEBANYH0427qZtLVuL97htXW9kCAT75xbMwgRskAH4nJDlZ
IggDErmzBhtrHgR+9X09iL47jr7dUcrVNPHzK/WXALFSKzXhkG/yAgmt3r14WgJ6
5y7010LlPFrzaNEyO/S4ISuBLt4cinjJsrFpoo0WI8jXeM5ddG6ncxdurKXMymY7
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::.::
:::::::::::::::::::::::::::.::::::::::::::::::::::::::::::::::::
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLlL
 
 
 
YYYYYYYYYYYYYYYYYYYYYyYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
gff0GJCOMZ65pMSy3A3cSAtjlKnb4fWzuHD5CFbusN4WhCT/tNxGNSpzvxd8GIDs
nY7exs9L230oCCpedVgcbayHCbkChEfoPzL1e1jXjgCwCTgt8GjeEFqc1gXNEaUn
O8AJ4VlR8fRszHm6yR0ZUBdY7UJddxQiYOzt0S1RLlECggEAbdcs4mZdqf3OjejJ
06oTPs9NRtAJVZlppSi7pmmAyaNpOuKWMoLPElDAQ3Q7VX26LlExLCZoPOVpdqDH
KbdmBEfTR4e11Pn9vYdu9/i6o10U4hpmf4TYKlqk10g1Sj21l8JATj/7Diey8scO
sAI1iftSg3aBSj8W7rxCxSezrENzuqw5D95a/he1cMUTB6XuravqZK5O4eR0vrxR
AvMzXk5OXrUEALUvt84u6m6XZZ0pq5XZxq74s8p/x1JvTwcpJ3jDKNEixlHfdHEZ
ZIu/xpcwD5gRfVGQamdcWvzGHZYLBFO1y5kAtL8kI9tW7WaouWVLmv99AyxdAaCB
Y5mBAQKCAQEAzU7AnorPzYndlOzkxRFtp6MGsvRBsvvqPLCyUFEXrHNV872O7tdO
GmsMZl+q+TJXw7O54FjJJvqSSS1sk68AGRirHop7VQce8U36BmI2ZX6j2SVAgIkI
9m3btCCt5rfiCatn2+Qg6HECmrCsHw6H0RbwaXS4RZUXD/k4X+sslBitOb7K+Y+N
Bacq6QxxjlIqQdKKPs4P2PNHEAey+kEJJGEQ7bTkNxCZ21kgi1Sc5L8U/IGy0BMC
PvJxssLdaWILyp3Ws8Q4RAoC5c0ZP0W2j+5NSbi3jsDFi0Y6/2GRdY1HAZX4twem
Q0NCedq1JNatP1gsb6bcnVHFDEGsj/35oQKCAQEAgmWMuSrojR/fjJzvke6Wvbox
FRnPk+6YRzuYhAP/YPxSRYyB5at++5Q1qr7QWn7NFozFIVFFT8CBU36ktWQ39MGm
cJ5SGyN9nAbbuWA6e+/u059R7QL+6f64xHRAGyLT3gOb1G0N6h7VqFT25q5Tq0rc
Lf/CvLKoudjv+sQ5GKBPT18+zxmwJ8YUWAsXUyrqoFWY/Tvo5yLxaC0W2gh3+Ppi
EDqe4RRJ3VKuKfZxHn5VLxgtBFN96Gy0+Htm5tiMKOZMYAkHiL+vrVZAX0hIEuRZ
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
-----END RSA PRIVATE KEY-----" >> id_rsa

Now, if you try to reconstruct this key by removing the obvious garbage lines (the ones that are all repeated characters, some of which aren t even valid base64 characters), it still isn t a key at least, openssl pkey doesn t want anything to do with it. The key is very much still in there, though, as we shall soon see. Using a gem I wrote and a quick bit of Ruby, we can extract a complete private key. The irb session looks something like this:

>> require "derparse"
>> b64 = <<EOF
MIIJKgIBAAKCAgEAxEVih1JGb8gu/Fm4AZh+ZwJw/pjzzliWrg4mICFt1g7SmIE2
TCQMKABdwd11wOFKCPc/UzRH/fHuQcvWrpbOSdqev/zKff9iedKw/YygkMeIRaXB
fYELqvUAOJ8PPfDm70st9GJRhjGgo5+L3cJB2gfgeiDNHzaFvapRSU0oMGQX+kI9
ezsjDAn+0Pp+r3h/u1QpLSH4moRFGF4omNydI+3iTGB98/EzuNhRBHRNq4oBV5SG
Pq/A1bem2ninnoEaQ+OPESxYzDz3Jy9jV0W/6LvtJ844m+XX69H5fqq5dy55z6DW
sGKn78ULPVZPsYH5Y7C+CM6GAn4nYCpau0t52sqsY5epXdeYx4Dc+Wm0CjXrUDEe
Egl4loPKDxJkQqQ/MQiz6Le/UK9vEmnWn1TRXK3ekzNV4NgDfJANBQobOpwt8WVB
rbsC0ON7n680RQnl7PltK9P1AQW5vHsahkoixk/BhcwhkrkZGyDIl9g8Q/Euyoq3
eivKPLz7/rhDE7C1BzFy7v8AjC3w7i9QeHcWOZFAXo5hiDasIAkljDOsdfD4tP5/
wSO6E6pjL3kJ+RH2FCHd7ciQb+IcuXbku64ln8gab4p8jLa/mcMI+V3eWYnZ82Yu
axsa85hAe4wb60cp/rCJo7ihhDTTvGooqtTisOv2nSvCYpcW9qbL6cGjAXECAwEA
AQKCAgEAjz6wnWDP5Y9ts2FrqUZ5ooamnzpUXlpLhrbu3m5ncl4ZF5LfH+QDN0Kl
KvONmHsUhJynC/vROybSJBU4Fu4bms1DJY3C39h/L7g00qhLG7901pgWMpn3QQtU
4P49qpBii20MGhuTsmQQALtV4kB/vTgYfinoawpo67cdYmk8lqzGzzB/HKxZdNTq
s+zOfxRr7PWMo9LyVRuKLjGyYXZJ/coFaobWBi8Y96Rw5NZZRYQQXLIalC/Dhndm
AHckpstEtx2i8f6yxEUOgPvV/gD7Akn92RpqOGW0g/kYpXjGqZQy9PVHGy61sInY
HSkcOspIkJiS6WyJY9JcvJPM6ns4b84GE9qoUlWVF3RWJk1dqYCw5hz4U8LFyxsF
R6WhYiImvjxBLpab55rSqbGkzjI2z+ucDZyl1gqIv9U6qceVsgRyuqdfVN4deU22
LzO5IEDhnGdFqg9KQY7u8zm686Ejs64T1sh0y4GOmGsSg+P6nsqkdlXH8C+Cf03F
lqPFg8WQC7ojl/S8dPmkT5tcJh3BPwIWuvbtVjFOGQc8x0lb+NwK8h2Nsn6LNazS
0H90adh/IyYX4sBMokrpxAi+gMAWiyJHIHLeH2itNKtAQd3qQowbrWNswJSgJzsT
JuJ7uqRKAFkE6nCeAkuj/6KHHMPsfCAffVdyGaWqhoxmPOrnVgECggEBAOrCCwiC
XxwUgjOfOKx68siFJLfHf4vPo42LZOkAQq5aUmcWHbJVXmoxLYSczyAROopY0wd6
Dx8rqnpO7OtZsdJMeBSHbMVKoBZ77hiCQlrljcj12moFaEAButLCdZFsZW4zF/sx
kWIAaPH9vc4MvHHyvyNoB3yQRdevu57X7xGf9UxWuPil/jvdbt9toaraUT6rUBWU
GYPNKaLFsQzKsFWAzp5RGpASkhuiBJ0Qx3cfLyirjrKqTipe3o3gh/5RSHQ6VAhz
gdUG7WszNWk8FDCL6RTWzPOrbUyJo/wz1kblsL3vhV7ldEKFHeEjsDGroW2VUFlS
asAHNvM4/uYcOSECggEBANYH0427qZtLVuL97htXW9kCAT75xbMwgRskAH4nJDlZ
IggDErmzBhtrHgR+9X09iL47jr7dUcrVNPHzK/WXALFSKzXhkG/yAgmt3r14WgJ6
5y7010LlPFrzaNEyO/S4ISuBLt4cinjJsrFpoo0WI8jXeM5ddG6ncxdurKXMymY7
EOF
>> b64 += <<EOF
gff0GJCOMZ65pMSy3A3cSAtjlKnb4fWzuHD5CFbusN4WhCT/tNxGNSpzvxd8GIDs
nY7exs9L230oCCpedVgcbayHCbkChEfoPzL1e1jXjgCwCTgt8GjeEFqc1gXNEaUn
O8AJ4VlR8fRszHm6yR0ZUBdY7UJddxQiYOzt0S1RLlECggEAbdcs4mZdqf3OjejJ
06oTPs9NRtAJVZlppSi7pmmAyaNpOuKWMoLPElDAQ3Q7VX26LlExLCZoPOVpdqDH
KbdmBEfTR4e11Pn9vYdu9/i6o10U4hpmf4TYKlqk10g1Sj21l8JATj/7Diey8scO
sAI1iftSg3aBSj8W7rxCxSezrENzuqw5D95a/he1cMUTB6XuravqZK5O4eR0vrxR
AvMzXk5OXrUEALUvt84u6m6XZZ0pq5XZxq74s8p/x1JvTwcpJ3jDKNEixlHfdHEZ
ZIu/xpcwD5gRfVGQamdcWvzGHZYLBFO1y5kAtL8kI9tW7WaouWVLmv99AyxdAaCB
Y5mBAQKCAQEAzU7AnorPzYndlOzkxRFtp6MGsvRBsvvqPLCyUFEXrHNV872O7tdO
GmsMZl+q+TJXw7O54FjJJvqSSS1sk68AGRirHop7VQce8U36BmI2ZX6j2SVAgIkI
9m3btCCt5rfiCatn2+Qg6HECmrCsHw6H0RbwaXS4RZUXD/k4X+sslBitOb7K+Y+N
Bacq6QxxjlIqQdKKPs4P2PNHEAey+kEJJGEQ7bTkNxCZ21kgi1Sc5L8U/IGy0BMC
PvJxssLdaWILyp3Ws8Q4RAoC5c0ZP0W2j+5NSbi3jsDFi0Y6/2GRdY1HAZX4twem
Q0NCedq1JNatP1gsb6bcnVHFDEGsj/35oQKCAQEAgmWMuSrojR/fjJzvke6Wvbox
FRnPk+6YRzuYhAP/YPxSRYyB5at++5Q1qr7QWn7NFozFIVFFT8CBU36ktWQ39MGm
cJ5SGyN9nAbbuWA6e+/u059R7QL+6f64xHRAGyLT3gOb1G0N6h7VqFT25q5Tq0rc
Lf/CvLKoudjv+sQ5GKBPT18+zxmwJ8YUWAsXUyrqoFWY/Tvo5yLxaC0W2gh3+Ppi
EDqe4RRJ3VKuKfZxHn5VLxgtBFN96Gy0+Htm5tiMKOZMYAkHiL+vrVZAX0hIEuRZ
EOF
>> der = b64.unpack("m").first
>> c = DerParse.new(der).first_node.first_child
>> version = c.value
=> 0
>> c = c.next_node
>> n = c.value
=> 80071596234464993385068908004931... # (etc)
>> c = c.next_node
>> e = c.value
=> 65537
>> c = c.next_node
>> d = c.value
=> 58438813486895877116761996105770... # (etc)
>> c = c.next_node
>> p = c.value
=> 29635449580247160226960937109864... # (etc)
>> c = c.next_node
>> q = c.value
=> 27018856595256414771163410576410... # (etc)

What I ve done, in case you don t speak Ruby, is take the two chunks of plausible-looking base64 data, chuck them together into a variable named b64, unbase64 it into a variable named der, pass that into a new DerParse instance, and then walk the DER value tree until I got all the values I need. Interestingly, the q value actually traverses the split in the two chunks, which means that there s always the possibility that there are lines missing from the key. However, since p and q are supposed to be prime, we can sanity check them to see if corruption is likely to have occurred:

>> require "openssl"
>> OpenSSL::BN.new(p).prime?
=> true
>> OpenSSL::BN.new(q).prime?
=> true

Excellent! The chances of a corrupted file producing valid-but-incorrect prime numbers isn t huge, so we can be fairly confident that we ve got the real p and q. Now, with the help of another one of my creations we can use e, p, and q to create a fully-operational battle key:

>> require "openssl/pkey/rsa"
>> k = OpenSSL::PKey::RSA.from_factors(p, q, e)
=> #<OpenSSL::PKey::RSA:0x0000559d5903cd38>
>> k.valid?
=> true
>> k.verify(OpenSSL::Digest::SHA256.new, k.sign(OpenSSL::Digest::SHA256.new, "bob"), "bob")
=> true

and there you have it. One fairly redacted-looking private key brought back to life by maths and far too much free time. Sorry Mr. Finn, I hope you re not still using that key on anything Internet-facing.

What About Other Key Types? EC keys are very different beasts, but they have much the same problems as RSA keys. A typical EC key contains both private and public data, and the public portion is twice the size so only about 1/3 of the data in the key is private material. It is quite plausible that you can redact an EC key and leave all the actually private bits exposed.

What Do We Do About It? In short: don t ever try and redact real private keys. For documentation purposes, just put KEY GOES HERE in the appropriate spot, or something like that. Store your secrets somewhere that isn t a public (or even private!) git repo. Generating a dummy private key and sticking it in there isn t a great idea, for different reasons: people have this odd habit of reusing demo keys in real life. There s no need to encourage that sort of thing.

Technically the pieces aren t 100% aligned with the underlying DER, because of how base64 works. I felt it was easier to understand if I stuck to chopping up the base64, rather than decoding into DER and then chopping up the DER.

17 March 2020

Norbert Preining: Brave broken (on Debian?)

Update a few hours later: an update came in, version 1.7.62, which fixed this misbehavior!!!! I have been using the Brave browser now for many months without any problem, in particular I use the -dev version, which is rather on the edge of development. Unfortunately, either due to Linux kernel changes (I am using the latest stable kernel, at the moment 5.5.9) or due to Brave changes, the browser has become completely unusable: CPU spikes consistently to 100% and more, and the dev browser tries to upload crash reports every 5sec. The problem is shown in the kernel log as

traps: brave[36513] trap invalid opcode ip:55d4e0cb1895 sp:7ffc4b08ff00 error:0 in brave[55d4dfe01000+78e5000]

I have checked older versions, as well as beta versions, all of them exhibit the same problem. I have disabled all extensions, without any change in behavior. I also have reported this problem in the Brave community forum, without any answer. This means, for now I have to switch back to Firefox, which works smoothly.

6 March 2020

Reproducible Builds: Reproducible Builds in February 2020

Welcome to the February 2020 report from the Reproducible Builds project. One of the original promises of open source software is that distributed peer review and transparency of process results in enhanced end-user security. However, whilst anyone may inspect the source code of free and open source software for malicious flaws, almost all software today is distributed as pre-compiled binaries. This allows nefarious third-parties to compromise systems by injecting malicious code into ostensibly secure software during the various compilation and distribution processes. The motivation behind the reproducible builds effort is to provide the ability to demonstrate these binaries originated from a particular, trusted, source release: if identical results are generated from a given source in all circumstances, reproducible builds provides the means for multiple third-parties to reach a consensus on whether a build was compromised via distributed checksum validation or some other scheme. In this month s report, we cover:

Media coverage & upstream news A new paper on reproducible containers, Ruby updates, etc.
Distribution work More work in Debian, openSUSE & friends.
Software development Updates and improvements to our tooling.
Getting in touch How to contribute, more venues for discussion.

If you are interested in contributing to the project, please visit our Contribute page on our website.

Media coverage & upstream news

Omar Navarro Leija, a PhD student at the University Of Pennsylvania, published a paper entitled Reproducible Containers that describes in detail the workings of a new user-space container tool called DetTrace:

All computation that occurs inside a DetTrace container is a pure function of the initial filesystem state of the container. Reproducible containers can be used for a variety of purposes, including replication for fault-tolerance, reproducible software builds and reproducible data analytics. We use DetTrace to achieve, in an automatic fashion, reproducibility for 12,130 Debian package builds, containing over 800 million lines of code, as well as bioinformatics and machine learning workflows.

There was also considerable discussion on our mailing list regarding this research and a presentation based on the paper will occur at the ASPLOS 2020 conference between March 16th 20th in Lausanne, Switzerland.

The many virtues of Reproducible Builds were touted as benefits for software compliance in a talk at FOSDEM 2020, debating whether the Careful Inventory of Licensing Bill of Materials Have Impact of FOSS License Compliance which pitted Jeff McAffer and Carol Smith against Bradley Kuhn and Max Sills. (~47 minutes in). Nobuyoshi Nakada updated the canonical implementation of the Ruby programming language a change such that filesystem globs (ie. calls to list the contents of filesystem directories) will henceforth be sorted in ascending order. Without this change, the underlying nondeterministic ordering of the filesystem is exposed to the language which often results in an unreproducible build. Vagrant Cascadian reported on our mailing list regarding a quick reproducible test for the GNU Guix distribution, which resulted in 81.9% of packages registering as reproducible in his installation:

$ guix challenge --verbose --diff=diffoscope ...
2,463 store items were analyzed:
  - 2,016 (81.9%) were identical
  - 37 (1.5%) differed
  - 410 (16.6%) were inconclusive

Jeremiah Orians announced on our mailing list the release of a number of tools related to cross-compilation such as M2-Planet and mescc-tools-seed. This project attemps a full bootstrap of a cross-platform compiler for the C programming language (written in C itself) from hex, the ultimate goal being able to demonstrate fully-bootstrapped compiler from hex to the GCC GNU Compiler Collection. This has many implications in and around Ken Thompson s Trusting Trust attack outlined in Thompson s 1983 Turing Award Lecture. Twitter user @TheYoctoJester posted an executive summary of reproducible builds in the Yocto Project:

Finally, Reddit user tofflos posted to the /r/Java subreddit asking about how to achieve reproducible builds with Maven and Chris Lamb noticed that the Linux kernel documentation about reproducible builds of it is available on the kernel.org homepages in an attractive HTML format.

Distribution work

Debian Chris Lamb created a merge request for the core `debian-installer` package to allow all arguments and options from `sources.list` files (such as `[check-valid-until=no]` , etc.) in order that we can test the reproducibility of the installer images on the Reproducible Builds own testing infrastructure. (#13) Thorsten Glaser followed-up to a bug filed against the `dpkg-source` component that was originally filed in late 2015 that claims that the build tool does not respect permissions when unpacking tarballs if the umask is set to `0002`. Matthew Garrett posted to the `debian-devel` mailing list on the topic of Producing verifiable initramfs images as part of a wider conversation on being able to trust the entire software stack on our computers. 59 reviews of Debian packages were added, 30 were updated and 42 were removed this month adding to our knowledge about identified issues. Many issue types were noticed and categorised by Chris Lamb, including:

`openstack_pkg_tools_python_shebang_and_dependencies`

`build_path_in_code_generated_by_dgbus_codegen`

`random_argument_handling_in_javatools_jh_build`

`fortran_captures_build_path`

etc.

openSUSE In openSUSE, Bernhard M. Wiedemann published his monthly Reproducible Builds status update as well as provided the following patches:

`python-rpm-macros` (do not save time-based `.pyc` files for tests)

`solfege` (filesystem ordering issue sent upstream via email; package is orphaned upstream)

`DVDStyler` (zip timestamps, submitted upstream)

`python-pikepdf` (recreate unreproducible `.pyc` files)

Software development

diffoscope diffoscope is our in-depth and content-aware diff-like utility that can locate and diagnose reproducibility issues. It is run countless times a day on our testing infrastructure and is essential for identifying fixes and causes of nondeterministic behaviour. Chris Lamb made the following changes this month, including uploading version `137` to Debian:

The `sng` image utility appears to return with an exit code of 1 if there are even minor errors in the file. (#950806)

Also extract `classes2.dex`, `classes3.dex` from `.apk` files extracted by `apktool`. (#88)

No need to use `str.format` if we are just returning the string. [ ]

Add generalised support for ignoring returncodes [ ] and move special-casing of returncodes in zip to use `Command.VALID_RETURNCODES`. [ ]

Other tools disorderfs is our FUSE-based filesystem that deliberately introduces non-determinism into directory system calls in order to flush out reproducibility issues. This month, Vagrant Cascadian updated the `Vcs-Git` to specify the `debian` packaging branch. [ ] reprotest is our end-user tool to build same source code twice in widely differing environments and then checks the binaries produced by each build for any differences. This month, versions `0.7.13` and `0.7.14` were uploaded to Debian unstable by Holger Levsen after Vagrant Cascadian added support for GNU Guix [ ].

Project documentation & website There was more work performed on our documentation and website this month. Bernhard M. Wiedemann added a Java Gradle Build Tool snippet to the `SOURCE_DATE_EPOCH` documentation [ ] and normalised various terms to unreproducible [ ]. Chris Lamb added a Meson.build example [ ] and improved the documentation for the CMake [ ] to the `SOURCE_DATE_EPOCH` documentation, replaced anyone can with anyone may as, well, not everyone has the resources, skills, time or funding to actually do what it refers to [ ] and improved the pre-processing for our report generation [ ][ ][ ][ ] etc. In addition, Holger Levsen updated our news page to improve the list of reports [ ], added an explicit mention of the weekly news time span [ ] and reverted sorting of news entries to have latest on top [ ] and Mattia Rizzolo added Codethink as a non-fiscal sponsor [ ] and lastly Tianon Gravi added a Docker Images link underneath the Debian project on our Projects page [ ].

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including:

Bernhard M. Wiedemann (for the openSUSE distribution):

`rdiff-backup` fixed, build date in Python manpage

`click-man` fixed, toolchain, build date in Python manpage

`rpm` report filesystem-order related nondeterminism

`python-enaml` report nondeterministic `.enamlc` compiler output

`k9s` date

`python-xcffib` fix a test with a race condition

Chris Lamb:

#943956 re-opened against `snakemake`.

#950630 filed against `designate`.

#950936 filed against `pynwb`.

#950942 filed against `python-oslo.reports`.

#951357 filed against `mate-desktop` (forwarded upstream).

#951573 filed against `javatools`.

#952493 filed against `xavs2`.

#952694 filed against `snapd-glib` (forwarded upstream).

#952762 filed against `openstack-pkg-tools`.

ccextractor issues with MacOSX BSD date not understanding GNU date options

`python-django` (Always build the documentation in English)

`python_example` (sort `glob` / `readdir`)

Vagrant Cascadian submitted patches via the Debian bug tracking system targeting the packages the Civil Infrastructure Platform has identified via the CIP and CIP build depends package sets:

#950410 filed against `autogen`.

#950417 & 950416 & 950415 filed against `autoconf`.

#950419 filed against `m4`.

#950444 filed against `intltool`.

#950585 filed against `binutils-dev`.

#950603 filed against `yodl-doc`.

#950606 filed against `gdb-source`.

#950704 filed against `automake-1.15`.

#951031 filed against `libtool`.

Testing framework We operate a fully-featured and comprehensive Jenkins-based testing framework that powers tests.reproducible-builds.org. This month, the following changes were made by Holger Levsen:

Debian:

Configured an instance of David Bremner s `builtin-pho` database of `.buildinfo` files. This has resulted in so that we now know that Debian bullseye contains 4,557 source packages for the `amd64` architecture without corresponding `.buildinfo` files and 25,668 source packages with `.buildinfo` files. [ ][ ][ ][ ][ ]

Forward mails addressed to `buildinfo@` to the `root` user. [ ]

Temporarily revert using backports [ ][ ] and use `devscripts` from buster-backports [ ].

Update URL for PureOS package set. [ ]

Deprioritise the scheduling of older old packages in bullseye and unstable. [ ][ ]

Treat the bullseye distribution like unstable when scheduling the `i386` architecture, to allow the latter to catch up a bit. [ ]

Misc/other:

Disable the last active Alpine builder too for now. [ ]

Improve generated HTML output. [ ][ ]

Ignore Fedora jobs. [ ]

Include links to failed and unstable jobs in HTML output. [ ]

Start weighting job importance and five a lot more weight to nodes acting as a proxy for others. [ ][ ][ ]

Add new job to calculate the overall system health for tests.reproducible-builds.org for usage with Jelle van der Waa s Reproducible Builds status display. [ ]

In addition, Mattia Rizzolo added an Apache web server redirect for buildinfos.debian.net [ ] and reverted the reshuffling of `arm64` architecture builders [ ]. The usual build node maintenance was performed by Holger Levsen, Mattia Rizzolo [ ][ ] and Vagrant Cascadian.

Getting in touch If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

IRC: `#reproducible-builds` on `irc.oftc.net`.

Twitter: @ReproBuilds

Reddit: /r/ReproducibleBuilds

Mailing list: `rb-general@lists.reproducible-builds.org`

This month s report was written by Bernhard M. Wiedemann, Chris Lamb and Holger Levsen. It was subsequently reviewed by a bunch of Reproducible Builds folks on IRC and the mailing list.

16 November 2017

Colin Watson: Kitten Block equivalent for Firefox 57

I ve been using Kitten Block for years, since I don t really need the blood pressure spike caused by accidentally following links to certain UK newspapers. Unfortunately it hasn t been ported to Firefox 57. I tried emailing the author a couple of months ago, but my email bounced. However, if your primary goal is just to block the websites in question rather than seeing kitten pictures as such (let s face it, the internet is not short of alternative sources of kitten pictures), then it s easy to do with uBlock Origin. After installing the extension if necessary, go to Tools Add-ons Extensions uBlock Origin Preferences My filters, and add www.dailymail.co.uk and www.express.co.uk, each on its own line. (Of course you can easily add more if you like.) Voil : instant tranquility. Incidentally, this also works fine on Android. The fact that it was easy to install a good ad blocker without having to mess about with a rooted device or strange proxy settings was the main reason I switched to Firefox on my phone.

23 August 2017

Antoine Beaupr : The supposed decline of copyleft

At DebConf17, John Sullivan, the executive director of the FSF, gave a talk on the supposed decline of the use of copyleft licenses use free-software projects. In his presentation, Sullivan questioned the notion that permissive licenses, like the BSD or MIT licenses, are gaining ground at the expense of the traditionally dominant copyleft licenses from the FSF. While there does seem to be a rise in the use of permissive licenses, in general, there are several possible explanations for the phenomenon.

When the rumor mill starts Sullivan gave a recent example of the claim of the decline of copyleft in an article on Opensource.com by Jono Bacon from February 2017 that showed a histogram of license usage between 2010 and 2017 (seen below).

From that, Bacon elaborates possible reasons for the apparent decline of the GPL. The graphic used in the article was actually generated by Stephen O'Grady in a January article, The State Of Open Source Licensing, which said:
In Black Duck's sample, the most popular variant of the GPL version 2 is less than half as popular as it was (46% to 19%). Over the same span, the permissive MIT has gone from 8% share to 29%, while its permissive cousin the Apache License 2.0 jumped from 5% to 15%.
Sullivan, however, argued that the methodology used to create both articles was problematic. Neither contains original research: the graphs actually come from the Black Duck Software "KnowledgeBase" data, which was partly created from the old Ohloh web site now known as Open Hub. To show one problem with the data, Sullivan mentioned two free-software projects, GNU Bash and GNU Emacs, that had been showcased on the front page of Ohloh.net in 2012. On the site, Bash was (and still is) listed as GPLv2+, whereas it changed to GPLv3 in 2011. He also claimed that "Emacs was listed as licensed under GPLv3-only, which is a license Emacs has never had in its history", although I wasn't able to verify that information from the Internet archive. Basically, according to Sullivan, "the two projects featured on the front page of a site that was using [the Black Duck] data set were wrong". This, in turn, seriously brings into question the quality of the data:
I reported this problem and we'll continue to do that but when someone is not sharing the data set that they're using for other people to evaluate it and we see glimpses of it which are incorrect, that should give us a lot of hesitation about accepting any conclusion that comes out of it.
Reproducible observations are necessary to the establishment of solid theories in science. Sullivan didn't try to contact Black Duck to get access to the database, because he assumed (rightly, as it turned out) that he would need to "pay for the data under terms that forbid you to share that information with anybody else". So I wrote Black Duck myself to confirm this information. In an email interview, Patrick Carey from Black Duck confirmed its data set is proprietary. He believes, however, that through a "combination of human and automated techniques", Black Duck is "highly confident at the accuracy and completeness of the data in the KnowledgeBase". He did point out, however, that "the way we track the data may not necessarily be optimal for answering the question on license use trend" as "that would entail examination of new open source projects coming into existence each year and the licenses used by them". In other words, even according to Black Duck, its database may not be useful to establish the conclusions drawn by those articles. Carey did agree with those conclusions intuitively, however, saying that "there seems to be a shift toward Apache and MIT licenses in new projects, though I don't have data to back that up". He suggested that "an effective way to answer the trend question would be to analyze the new projects on GitHub over the last 5-10 years." Carey also suggested that "GitHub has become so dominant over the recent years that just looking at projects on GitHub would give you a reasonable sampling from which to draw conclusions".

Indeed, GitHub published a report in 2015 that also seems to confirm MIT's popularity (45%), surpassing copyleft licenses (24%). The data is, however, not without its own limitations. For example, in the above graph going back to the inception of GitHub in 2008, we see a rather abnormal spike in 2013, which seems to correlate with the launch of the choosealicense.com site, described by GitHub as "our first pass at making open source licensing on GitHub easier". In his talk, Sullivan was critical of the initial version of the site which he described as biased toward permissive licenses. Because the GitHub project creation page links to the site, Sullivan explained that the site's bias could have actually influenced GitHub users' license choices. Following a talk from Sullivan at FOSDEM 2016, GitHub addressed the problem later that year by rewording parts of the front page to be more accurate, but that any change in license choice obviously doesn't show in the report produced in 2015 and won't affect choices users have already made. Therefore, there can be reasonable doubts that GitHub's subset of software projects may not actually be that representative of the larger free-software community.

In search of solid evidence So it seems we are missing good, reproducible results to confirm or dispel these claims. Sullivan explained that it is a difficult problem, if only in the way you select which projects to analyze: the impact of a MIT-licensed personal wiki will obviously be vastly different from, say, a GPL-licensed C compiler or kernel. We may want to distinguish between active and inactive projects. Then there is the problem of code duplication, both across publication platforms (a project may be published on GitHub and SourceForge for example) but also across projects (code may be copy-pasted between projects). We should think about how to evaluate the license of a given project: different files in the same code base regularly have different licenses often none at all. This is why having a clear, documented and publicly available data set and methodology is critical. Without this, the assumptions made are not clear and it is unreasonable to draw certain conclusions from the results. It turns out that some researchers did that kind of open research in 2016 in a paper called "The Debsources Dataset: Two Decades of Free and Open Source Software" [PDF] by Matthieu Caneill, Daniel M. Germ n, and Stefano Zacchiroli. The Debsources data set is the complete Debian source code that covers a large history of the Debian project and therefore includes thousands of free-software projects of different origins. According to the paper:
The long history of Debian creates a perfect subject to evaluate how FOSS licenses use has evolved over time, and the popularity of licenses currently in use.
Sullivan argued that the Debsources data set is interesting because of its quality: every package in Debian has been reviewed by multiple humans, including the original packager, but also by the FTP masters to ensure that the distribution can legally redistribute the software. The existence of a package in Debian provides a minimal "proof of use": unmaintained packages get removed from Debian on a regular basis and the mere fact that a piece of software gets packaged in Debian means at least some users found it important enough to work on packaging it. Debian packagers make specific efforts to avoid code duplication between packages in order to ease security maintenance. The data set covers a period longer than Black Duck's or GitHub's, as it goes all the way back to the Hamm 2.0 release in 1998. The data and how to reproduce it are freely available under a CC BY-SA 4.0 license.

Sullivan presented the above graph from the research paper that showed the evolution of software license use in the Debian archive. Whereas previous graphs showed statistics in percentages, this one showed actual absolute numbers, where we can't actually distinguish a decline in copyleft licenses. To quote the paper again:
The top license is, once again, GPL-2.0+, followed by: Artistic-1.0/GPL dual-licensing (the licensing choice of Perl and most Perl libraries), GPL-3.0+, and Apache-2.0.
Indeed, looking at the graph, at most do we see a rise of the Apache and MIT licenses and no decline of the GPL per se, although its adoption does seem to slow down in recent years. We should also mention the possibility that Debian's data set has the opposite bias: toward GPL software. The Debian project is culturally quite different from the GitHub community and even the larger free-software ecosystem, naturally, which could explain the disparity in the results. We can only hope a similar analysis can be performed on the much larger Software Heritage data set eventually, which may give more representative results. The paper acknowledges this problem:
Debian is likely representative of enterprise use of FOSS as a base operating system, where stable, long-term and seldomly updated software products are desirable. Conversely Debian is unlikely representative of more dynamic FOSS environments (e.g., modern Web-development with micro libraries) where users, who are usually developers themselves, expect to receive library updates on a daily basis.
The Debsources research also shares methodology limitations with Black Duck: while Debian packages are reviewed before uploading and we can rely on the copyright information provided by Debian maintainers, the research also relies on automated tools (specifically FOSSology) to retrieve license information. Sullivan also warned against "ascribing reason to numbers": people may have different reasons for choosing a particular license. Developers may choose the MIT license because it has fewer words, for compatibility reasons, or simply because "their lawyers told them to". It may not imply an actual deliberate philosophical or ideological choice. Finally, he brought up the theory that the rise of non-copyleft licenses isn't necessarily at the detriment of the GPL. He explained that, even if there is an actual decline, it may not be much of a problem if there is an overall growth of free software to the detriment of proprietary software. He reminded the audience that non-copyleft licenses are still free software, according to the FSF and the Debian Free Software Guidelines, so their rise is still a positive outcome. Even if the GPL is a better tool to accomplish the goal of a free-software world, we can all acknowledge that the conversion of proprietary software to more permissive and certainly simpler licenses is definitely heading in the right direction.
[I would like to thank the DebConf organizers for providing meals for me during the conference.] Note: this article first appeared in the Linux Weekly News.

1 April 2017

Mike Hommey: Progress on git-cinnabar memory usage

This all started when I figured out that git-cinnabar was using crazy amounts of memory when cloning mozilla-central. That pointed to memory allocation patterns that triggered a suboptimal behavior in the glibc memory allocator, and, while overall, git-cinnabar wasn t really abusing memory all things considered, it happened to be realloc()ating way too much. It also turned out that recent changes on the master branch had made most uses of fast-import synchronous, making the whole process significantly slower. This is where we started from on 0.4.0:

And on the master branch as of be75326:

An interesting thing to note here is that the glibc allocator runaway memory use was, this time, more pronounced on 0.4.0 than on master. It was the opposite originally, but as I mentioned in the past ASLR is making it not happen exactly the same way each time. While I m here, one thing I failed to mention in the previous posts is that all these measurements were done by cloning a local mercurial clone of mozilla-central, served from localhost via HTTP to eliminate the download time from hg.mozilla.org. And while mozilla-central itself has received new changesets since the first post, the local clone has not been updated, such that all subsequent clone tests I did were cloning the exact same repository under the exact same circumstances. After last blog post, I focused on the low hanging fruits identified so far:

Moving the mercurial to git SHA1 mapping to the helper process (Finding a git bug in the process).
Tracking mercurial manifest heads in the helper process.
Removing most of the synchronous calls to the helper happening during a clone.

And this is how things now look on the master branch as of 35c18e7:

So where does that put us?

The overall clone is now about 11 minutes faster than 0.4.0 (and about 50 minutes faster than master as of be75326!)
Non-shared memory use of the git-remote-hg process stays well under 2GB during the whole clone, with no spike at the end.
git-cinnabar-helper now uses more memory, but the sum of both processes is less than what it used to be, even when compensating for the glibc memory allocator issue. One thing to note is that while the git-cinnabar-helper memory use goes above 2GB at the end of the clone, a very large part is due to the pack window size being 1GB on 64-bits (vs. 32MB on 32-bits). Memory usage should stay well under the 2GB address space limit on a 32-bits system.
CPU usage is well above 100% for most of the clone.

On a more granular level:

The Import manifests phase is now 13 minutes faster than it was in 0.4.0.
The Read and import files phase is still almost 4 minutes slower than in 0.4.0.
The Import changesets phase is still almost 2 minutes slower than in 0.4.0.
But the Finalization phase is now 3 minutes faster than in 0.4.0.

What this means is that there s still room for improvement. But at this point, I d rather focus on other things. Logging all the memory allocations with the python allocator disabled still resulted in a 6.5GB compressed log file, containing 2.6 billion calls to malloc, calloc, free and realloc (down from 2.7 billions in be75326). The number of allocator calls done by the git-remote-hg process is down to 2.25 billions (from 2.34 billion in be75326). Surprisingly, while more things were moved to the helper, it still made less allocations than in be75326: 345 millions, down from 363 millions. Presumably, this is because the number of commands processed by the fast-import code was reduced. Let s now take a look at the various metrics we analyzed previously (the horizontal axis represents the number of allocator calls that happened before the measurement):

A few observations to make here:

The allocated memory (requested bytes) is well below what it was, and the spike at the end is entirely gone. It also more closely follows the amount of raw data we re holding on to (which makes sense since most of the bookkeeping was moved to the helper)
The number of live allocations (allocated memory pointers that haven t been free()d yet) has gone significantly down as well.
The cumulated[*] bytes are now in a much more reasonable range, with the lower bound close to the total amount of data we re dealing with during the clone, and the upper bound slightly over twice that amount (the upper bound for the be75326 is not shown here, but it was around 45TB; less than 7TB is a big improvement).
There are less allocator calls during the first phases and the Importing changesets phase, but more during the Reading and importing files and Importing manifests phases.

[*] The upper bound is the sum of all sizes ever given to malloc, calloc, realloc etc. and the lower bound is the same, but removing the size of allocations passed as input to realloc (in practical words, this pretends reallocs never happened and that the final size for a given reallocated pointer is the one that counts) So presumably, some of the changes led to more short-lived allocations. Considering python uses its own allocator for sizes smaller than 512 bytes, it s probably not so much of a problem. But let s look at the distribution of buffer sizes (including all sizes given to realloc).

(Bucket size is 16 bytes) What is not completely obvious from the logarithmic scale is that, in fact, 98.4% of the allocations are less than 512 bytes with the current master (35c18e7), and they were 95.5% with be75326. Interestingly, though, in absolute numbers, there are less allocations smaller than 512 bytes in current master than in be75326 (1,194,268,071 vs 1,214,784,494). This suggests the extra allocations that happen during some phases are larger than that. There are clearly less allocations across the board (apart from very few exceptions), and close to an order of magnitude less allocations larger than 1MiB. In fact, widening the bucket size to 32KiB shows an order of magnitude difference (or close) for most buckets:

An interesting thing to note is how some sizes are largely overrepresented in the data with buckets of 16 bytes, like 768, 1104, 2048, 4128, with other smaller bumps for e.g. 2144, 2464, 2832, 3232, 3696, 4208, 4786, 5424, 6144, 6992, 7920 While some of those are powers of 2, most aren t, and some of them may actually represent objects sized with a power of 2, but that have an extra PyObject overhead. While looking at allocation stats, I got to wonder what the lifetimes of those allocations looked like. So I scanned the allocator logs and measured the distance between when an allocation is made and when it is freed, ignoring reallocs. To give a few examples of what I mean, the following allocation for p gets a lifetime of 0:

void *p = malloc(42);
free(p);

The following a lifetime of 1:

void *p = malloc(42);
void *other = malloc(42);
free(p);

And the following a lifetime of 1 as well:

void *p = malloc(42);
p = realloc(p, 84);
free(p);

(that is, it is not counted as two malloc/free pairs) The further away the free is from the corresponding malloc, the larger the lifetime. And the largest the lifetime can ever be is the total number of allocator function calls minus two, in the hypothetical case the very first allocation is freed as the very last (minus two because we defined the lifetime as the distance).

What comes out of this data:

As expected, there are more short-lived allocations in 35c18e7.
Around 90% of allocations have a lifetime spanning 10% of the process life or less. This is a rather surprisingly large amount of allocations with a very large lifetime.
Around 80% of allocations have a lifetime spanning 0.01% of the process life or less.
The median lifetime is around 0.0000002% (2*10^-7%) of the process life, which, in absolute terms is around 500 allocator function calls between a malloc and a free.
If we consider every imported changeset, manifest and file to require a similar number of allocations, and considering there are about 2.7M of them in total, each spans about 3.7*10^-7%. About 53% of all allocations on be75326 and 57% on 35c18e7 have a lifetime below that. Whenever I get to look more closely to memory usage again, I ll probably look at the data separately for each individual phase.
One surprising fact, that doesn t appear on the graph because of the logarithmic scale not showing 0 on the horizontal axis, is that 9.3% on be75326 and 7.3% on 35c18e7 of all allocations have a lifetime of 0. That is, whatever the code using them is doing, it s not allocating or freeing anything else, and not reallocating them either.

All in all, what the data shows is that we re definitely in a better place now than we used to be a few days ago, and that there is still work to do on the memory front, but:

As mentioned in a previous post, there are bigger wins to be had from not keeping manifests data around in memory at all, and by importing it directly instead.
In time, a lot of the import code is meant to move to the helper, where the constraints are completely different, and it might not be worth spending time now on reducing the memory usage of python code that might go away soon(ish). The situation was bad and necessitated action rather quickly, but we re now in a place where it s not as bad anymore.

So at this point, I won t look any deeper into the memory usage of the git-remote-hg python process, and will instead focus on the planned metadata storage changes. They will allow to share the metadata more easily (allowing faster and more straightforward gecko-dev graft), and will allow to import manifests earlier, which, as mentioned already, will help reduce memory use, but, more importantly, will allow to do more actual work while downloading the data. On slow networks, this is crucial to make clones and pulls faster.

28 March 2017

Axel Beckert: Maintaining Debian Packages of Perl Modules with dh-dist-zilla

Maintaining Debian packages of Perl modules usually can be done with the common git-buildpackage (aka gbp) workflow with its three git branches master (or debian), upstream and pristine-tar:

upstream contains the upstream code as imported from upstream s release tar-balls.
pristine-tar contains the binary diffs between the contents of the upstream branch and the original tar-ball. This mostly contains meta-data (timestamps, permissions, file owners, etc.) as git doesn t store them.
master (or debian) which contains upstream plus packaging.

This also works more or less fine for Perl modules, where the Debian package maintainer is also the upstream developer. In that case mostly the upstream branch is used (and then maybe called master while the Debian packaging branch is then called debian). But the files needed for a proper so called CPAN distribution of a Perl module often contain redundant information (version numbers, required modules, etc.) which needs to be maintained. And for that, many people prefer Don t Repeat Yourself (DRY) as a principle. Dist::Zilla One nice and common tool for that is Dist::Zilla or short dzil. It generates most redundant but required data out of a central source, e.g. Dist::Zilla s dist.ini or the contained .pm files, etc. dzil build creates tar ball which contains all files necessary by CPAN. But now we have a dilemma: Debian expects those generated files inside the upstream branch while the files are only generated from other files in that branch. There are multiple solutions, but all of them involve committing generated files to the git repository:

Commit them into the upstream branch. Disadvantage: You ll likely later forget which files were generated and which weren t.
Commit the generated files into a separated branch, e.g. use master (original code), upstream (original code + stuff generated by dzil build, maybe imported with git-import-orig), pristine-tar and a debian (based on upstream) branches.

librun-parts-perl aka Run::Parts (a Perl wrapper around and a pure-perl implementation of Debian s run-parts tool) was initially maintained in the latter way. But especially in cases where we just need a Perl module packaged as .deb without uploading it to CPAN (e.g. project-internal modules), this is a tedious workflow and overkill. It would be much nicer if debhelper would just call dzil to generate all the stuff it needs to build the package. dh-dist-zilla Well, you can do that now, at least with Debian Jessie. This is what dh-dist-zilla does: It is a debhelper sequence plugin which calls dzil build and dzil clean in the right moment and takes care that all dh_auto_* commands look in the directory with the generated files instead of the rather clean project root directory. To use dh-dist-zilla, you just need to add a build-dependency on it and the Dist::Zilla plugins you use, and add

--with
dist-zilla

to your minimal dh-style debian/rules file:

#!/usr/bin/make -f
%:
	dh $@ --with dist-zilla

That s it. With regards to workflow and git branches, you may still want to use separate branches for upstream work and debian work, and you may want to continue to use pristine-tar, but you don t have to commit generated files to git anymore and you can maintain a clean master branch with nearly no redundancy. And if you need to generate to final upstream tar ball for you debian package, just call dh get-orig-source or maybe easier to use with tab completion dh_dist_zilla_origtar. This is how the librun-parts-perl package is maintained nowadays. There s otherwise not much difference to the old, classically maintained versions. More DRY Next step in the DRY evolution is to reduce redundancies between upstream (Dist::Zilla based) packaging and the Debian packaging. There are a few tools available, partially brand new, partially not yet packaged:

dh-dist-zilla s dh-dzil-refresh which combines dh-make-perl s refresh subcommand with Dist::Zilla.
Enrico Zini s debdry, which aims to be a front-end to all the language specific packaging automation tools like dh-make-perl and gem2deb.
The not yet packaged Perl module distribution Dist-Zilla-Deb which beyond others contains the (slightly under-documented) Perl module Dist::Zilla::Plugin::Deb::VersionFromChangelog to use the version from Debian s changelog as the primary source for the version of the module. (Source code is on GitHub.)
And then there is Dist::Zilla::App::Command::authordebs aka libdist-zilla-app-command-authordebs-perl by Dominque Dumont which lists or installs Dist::Zilla authors dependencies as Debian packages. (Source code is on GitHub, too.)

I wouldn t be surprised if there s more to come in this area. P.S.: I actually started this blog posting in September 2014 and never finished it until now. Had to kick out some already outdated again stuff, but also could add some more recent things.

23 March 2017

Mike Hommey: Why is the git-cinnabar master branch slower to clone?

Apart from the memory considerations, one thing that the data presented in the When the memory allocator works against you post that I haven t touched in the followup posts is that there is a large difference in the time it takes to clone mozilla-central with git-cinnabar 0.4.0 vs. the master branch. One thing that was mentioned in the first followup is that reducing the amount of realloc and substring copies made the cloning more than 15 minutes faster on master. But the same code exists in 0.4.0, so this isn t part of the difference. So what s going on? Looking at the CPU usage during the clone is enlightening. On 0.4.0:

On master:

(Note: the data gathering is flawed in some ways, which explains why the git-remote-hg process goes above 100%, which is not possible for this python process. The data is however good enough for the high level analysis that follows, so I didn t bother to get something more acurate) On 0.4.0, the git-cinnabar-helper process was saturating one CPU core during the File import phase, and the git-remote-hg process was saturating one CPU core during the Manifest import phase. Overall, the sum of both processes usually used more than one and a half core. On master, however, the total of both processes barely uses more than one CPU core. What happened? This and that happened. Essentially, before those changes, git-remote-hg would send instructions to git-fast-import (technically, git-cinnabar-helper, but in this case it s only used as a wrapper for git-fast-import), and use marks to track the git objects that git-fast-import created. After those changes, git-remote-hg asks git-fast-import the git object SHA1 of objects it just asked to be created. In other words, those changes replaced something asynchronous with something synchronous: while it used to be possible for git-remote-hg to work on the next file/manifest/changeset while git-fast-import was working on the previous one, it now waits. The changes helped simplify the python code, but made the overall clone process much slower. If I m not mistaken, the only real use for that information is for the mapping of mercurial to git SHA1s, which is actually rarely used during the clone, except at the end, when storing it. So what I m planning to do is to move that mapping to the git-cinnabar-helper process, which, incidentally, will kill not 2, but 3 birds with 1 stone:

It will restore the asynchronicity, obviously (at least, that s the expected main outcome).
Storing the mapping in the git-cinnabar-helper process is very likely to take less memory than what it currently takes in the git-remote-hg process. Even if it doesn t (which I doubt), that should still help stay under the 2GB limit of 32-bit processes.
The whole thing that spikes memory usage during the finalization phase, as seen in previous post, will just go away, because the git-cinnabar-helper process will just have prepared the git notes-like tree on its own.

So expect git-cinnabar 0.5 to get moar faster, and to use moar less memory.

Mike Hommey: Analyzing git-cinnabar memory use

In previous post, I was looking at the allocations git-cinnabar makes. While I had the data, I figured I d also look how the memory use correlates with expectations based on repository data, to put things in perspective. As a reminder, this is what the allocations look like (horizontal axis being the number of allocator function calls):

There are 7 different phases happening during a git clone using git-cinnabar, most of which can easily be identified on the graph above:

Negotiation. During this phase, git-cinnabar talks to the mercurial server to determine what needs to be pulled. Once that is done, a getbundle request is emitted, which response is read in the next three phases. This phase is essentially invisible on the graph.
Reading changeset data. The first thing that a mercurial server sends in the response for a getbundle request is changesets. They are sent in the RevChunk format. Translated to git, they become commit objects. But to create commit objects, we need the entire corresponding trees and files (blobs), which we don t have yet. So we keep this data in memory. In the git clone analyzed here, there are 345643 changesets loaded in memory. Their raw size in RawChunk format is 237MB. I think by the end of this phase, we made 20 million allocator calls, have about 300MB of live data in about 840k allocations. (No certainty because I don t actually have definite data that would allow to correlate between the phases and allocator calls, and the memory usage change between this phase and next is not as clear-cut as with other phases). This puts us at less than 3 live allocations per changeset, with only about 60MB overhead over the raw data.
Reading manifest data. In the stream we receive, manifests follow changesets. Each changeset points to one manifest ; several changesets can point to the same manifest. Manifests describe the content of the entire source code tree in a similar manner as git trees, except they are flat (there s one manifest for the entire tree, where git trees would reference other git trees for sub directories). And like git trees, they only map file paths to file SHA1s. The way they are currently stored by git-cinnabar (which is planned to change) requires knowing the corresponding git SHA1s for those files, and we haven t got those yet, so again, we keep everything in memory. In the git clone analyzed here, there are 345398 manifests loaded in memory. Their raw size in RawChunk format is 1.18GB. By the end of this phase, we made 23 million more allocator calls, and have about 1.52GB of live data in about 1.86M allocations. We re still at less than 3 live allocations for each object (changeset or manifest) we re keeping in memory, and barely over 100MB of overhead over the raw data, which, on average puts the overhead at 150 bytes per object. The three phases so far are relatively fast and account for a small part of the overall process, so they don t appear clear-cut to each other, and don t take much space on the graph.
Reading and Importing files. After the manifests, we finally get files data, grouped by path, such that we get all the file revisions of e.g. .cargo/.gitignore, followed by all the file revisions of .cargo/config.in, .clang-format, and so on. The data here doesn t depend on anything else, so we can finally directly import the data. This means that for each revision, we actually expand the RawChunk into the full file data (RawChunks contain patches against a previous revision), and don t keep the RawChunk around. We also don t keep the full data after it was sent to the git-cinnabar-helper process (as far as cloning is concerned, it s essentially a wrapper for git-fast-import), except for the previous revision of the file, which is likely the patch base for the next revision. We however keep in memory one or two things for each file revision: a mapping of its mercurial SHA1 and the corresponding git SHA1 of the imported data, and, when there is one, the file metadata (containing information about file copy/renames) that lives as a header in the file data in mercurial, but can t be stored in the corresponding git blobs, otherwise we d have irrelevant data in checkouts. On the graph, this is where there is a steady and rather long increase of both live allocations and memory usage, in stairs for the latter. In the git clone analyzed here, there are 2.02M file revisions, 78k of which have copy/move metadata for a cumulated size of 8.5MB of metadata. The raw size of the file revisions in RawChunk format is 3.85GB. The expanded data size is 67GB. By the end of this phase, we made 622 million more allocator calls, and peaked at about 2.05GB of live data in about 6.9M allocations. Compared to the beginning of this phase, that added about 530MB in 5 million allocations. File metadata is stored in memory as python dicts, with 2 entries each, instead of raw form for convenience and future-proofing, so that would be at least 3 allocations each: one for each value, one for the dict, and maybe one for the dict storage ; their keys are all the same and are probably interned by python, so wouldn t count. As mentioned above, we store a mapping of mercurial to git SHA1s, so for each file that makes 2 allocations, 4.04M total. Plus the 230k or 310k from metadata. Let s say 4.45M total. We re short 550k allocations, but considering the numbers involved, it would take less than one allocation per file on average to go over this count. As for memory size, per this answer on stackoverflow, python strings have an overhead of 37 bytes, so each SHA1 (kept in hex form) will take 77 bytes (Note, that s partly why I didn t particularly care about storing them as binary form, that would only save 25%, not 50%). That s 311MB just for the SHA1s, to which the size of the mapping dict needs to be added. If it were a plain array of pointers to keys and values, it would take 2 * 8 bytes per file, or about 32MB. But that would be a hash table with no room for more items (By the way, I suspect the stairs that can be seen on the requested and in-use bytes is the hash table being realloc()ed). Plus at least 290 bytes per dict for each of the 78k metadata, which is an additional 22M. All in all, 530MB doesn t seem too much of a stretch.
Importing manifests. At this point, we re done receiving data from the server, so we begin by dropping objects related to the bundle we got from the server. On the graph, I assume this is the big dip that can be observed after the initial increase in memory use, bringing us down to 5.6 million allocations and 1.92GB. Now begins the most time consuming process, as far as mozilla-central is concerned: transforming the manifests into git trees, while also storing enough data to be able to reconstruct manifests later (which is required to be able to pull from the mercurial server after the clone). So for each manifest, we expand the RawChunk into the full manifest data, and generate new git trees from that. The latter is mostly performed by the git-cinnabar-helper process. Once we re done pushing data about a manifest to that process, we drop the corresponding data, except when we know it will be required later as the delta base for a subsequent RevChunk (which can happen in bundle2). As with file revisions, for each manifest, we keep track of the mapping of SHA1s between mercurial and git. We also keep a DAG of the manifests history (contrary to git trees, mercurial manifests track their ancestry ; files do too, but git-cinnabar doesn t actually keep track of that separately ; it just relies on the manifests data to infer file ancestry). On the graph, this is where the number of live allocations increases while both requested and in-use bytes decrease, noisily. By the end of this phase, we made about 1 billion more allocator calls. Requested allocations went down to 1.02GB, for close to 7 million live allocations. Compared to the end of the dip at the beginning of this phase, that added 1.4 million allocations, and released 900MB. By now, we expect everything from the Reading manifests phase to have been released, which means we allocated around 620MB (1.52GB 900MB), for a total of 3.26M additional allocations (1.4M + 1.86M). We have a dict for the SHA1s mapping (345k * 77 * 2 for strings, plus the hash table with 345k items, so at least 60MB), and the DAG, which, now that I m looking at memory usage, I figure has the one of the possibly worst structure, using 2 sets for each node (at least 232 bytes per set, that s at least 160MB, plus 2 hash tables with 345k items). I think 250MB for those data structures would be largely underestimated. It s not hard to imagine them taking 620MB, because really, that DAG implementation is awful. The number of allocations expected from them would be around 1.4M (4 * 345k), but I might be missing something. That s way less than the actual number, so it would be interesting to take a closer look, but not before doing something about the DAG itself. Fun fact: the amount of data we re dealing with in this phase (the expanded size of all the manifests) is close to 2.9TB (yes, terabytes). With about 4700 seconds spent on this phase on a real clone (less with the release branch), we re still handling more than 615MB per second.
Importing changesets. This is where we finally create the git commits corresponding to the mercurial changesets. For each changeset, we expand its RawChunk, find the git tree we created in the previous phase that corresponds to the associated manifest, and create a git commit for that tree, with the right date, author, and commit message. For data that appears in the mercurial changeset that can t be stored or doesn t make sense to store in the git commit (e.g. the manifest SHA1, the list of changed files[*], or some extra metadata like the source of rebases), we keep some metadata we ll store in git notes later on. [*] Fun fact: the list of changed files stored in mercurial changesets does not necessarily match the list of files in a git diff between the corresponding git commit and its parents, for essentially two reasons:
- Old buggy versions of mercurial have generated erroneous lists that are now there forever (they are part of what makes the changeset SHA1).
- Mercurial may create new revisions for files even when the file content is not modified, most notably during merges (but that also happened on non-merges due to, presumably, bugs).
so we keep it verbatim. On the graph, this is where both requested and in-use bytes are only slightly increasing. By the end of this phase, we made about half a billion more allocator calls. Requested allocations went up to 1.06GB, for close to 7.7 million live allocations. Compared to the end of the previous phase, that added 700k allocations, and 400MB. By now, we expect everything from the Reading changesets phase to have been released (at least the raw data we kept there), which means we may have allocated at most around 700MB (400MB + 300MB), for a total of 1.5M additional allocations (700k + 840k). All these are extra data we keep for the next and final phase. It s hard to evaluate the exact size we d expect here in memory, but if we divide by the number of changesets (345k), that s less than 5 allocations per changeset and less than 2KB per changeset, which is low enough not to raise eyebrows, at least for now.
Finalizing the clone. The final phase is where we actually go ahead storing the mappings between mercurial and git SHA1s (all 2.7M of them), the git notes where we store the data necessary to recreate mercurial changesets from git commits, and a cache for mercurial tags. On the graph, this is where the requested and in-use bytes, as well as the number of live allocations peak like crazy (up to 21M allocations for 2.27GB requested). This is very much unwanted, but easily explained with the current state of the code. The way the mappings between mercurial and git SHA1s are stored is via a tree similar to how git notes are stored. So for each mercurial SHA1, we have a file that points to the corresponding git SHA1 through git links for commits or directly for blobs (look at the output of git ls-tree -r refs/cinnabar/metadata^3 if you re curious about the details). If I remember correctly, it s faster if the tree is created with an ordered list of paths, so the code created a list of paths, and then sorted it to send commands to create the tree. The former creates a new str of length 42 and a tuple of 3 elements for each and every one of the 2.7M mappings. With the 37 bytes overhead by str instance and the 56 + 3 * 8 bytes per tuple, we have at least 429MB wasted. Creating the tree itself keeps the corresponding fast-import commands in a buffer, where each command is going to be a tuple of 2 elements: a pointer to a method, and a str of length between 90 and 93. That s at least another 440MB wasted. I already fixed the first half, but the second half still needs addressing.

Overall, except for the stupid spike during the final phase, the manifest DAG and the glibc allocator runaway memory use described in previous posts, there is nothing terribly bad with the git-cinnabar memory usage, all things considered. Mozilla-central is just big. The spike is already half addressed, and work is under way for the glibc allocator runaway memory use. The manifest DAG, interestingly, is actually mostly useless. It s only used to track the heads of the DAG, and it s very much possible to track heads of a DAG without actually storing the entire DAG. In fact, that s what git-cinnabar already does for changeset heads so we would only need to do the same for manifest heads. One could argue that the 1.4GB of raw RevChunk data we re keeping in memory for later user could be kept on disk instead. I haven t done this so far because I didn t want to have to handle temporary files (and answer questions like where to put them? , what if there isn t enough disk space there? , what if disk access is slow? , etc.). But the majority of this data is from manifests. I m already planning changes in how git-cinnabar stores manifests data that will actually allow to import them directly, instead of keeping them in memory until files are imported. This would instantly remove 1.18GB of memory usage. The downside, however, is that this would be more CPU intensive: Importing changesets will require creating the corresponding git trees, and getting the stored manifest data. I think it s worth, though. Finally, one thing that isn t obvious here, but that was found while analyzing why RSS would be going up despite memory usage going down, is that git-cinnabar is doing way too many reallocations and substring allocations. So let s look at two metrics that hopefully will highlight the problem:

The cumulated amount of requested memory. That is, the sum of all sizes ever given to malloc, realloc, calloc, etc.
The compensated cumulated amount of requested memory (naming is hard). That is, the sum of all sizes ever given to malloc, calloc, etc. except realloc. For realloc, we only count the delta in size between what the size was before and after the realloc.

Assuming all the requested memory is filled at some point, the former gives us an upper bound to the amount of memory that is ever filled or copied (the amount that would be filled if no realloc was ever in-place), while the the latter gives us a lower bound (the amount that would be filled or copied if all reallocs were in-place). Ideally, we d want the upper and lower bounds to be close to each other (indicating few realloc calls), and the total amount at the end of the process to be as close as possible to the amount of data we re handling (which we ve seen is around 3TB).

and this is clearly bad. Like, really bad. But we already knew that from the previous post, although it s nice to put numbers on it. The lower bound is about twice the amount of data we re handling, and the upper bound is more than 10 times that amount. Clearly, we can do better. We ll see how things evolve after the necessary code changes happen. Stay tuned.

14 March 2017

John Goerzen: Parsing the GOP s Health Insurance Statistics

There has been a lot of noise lately about the GOP health care plan (AHCA) and the differences to the current plan (ACA or Obamacare). A lot of statistics are being misinterpreted. The New York Times has an excellent analysis of some of this. But to pick it apart, I want to highlight a few things: Many Republicans are touting the CBO s estimate that, some years out, premiums will be 10% lower under their plan than under the ACA. However, this carries with it a lot of misleading information. First of all, many are spinning this as if costs would go down. That s not the case. The premiums would still rise they would just have risen less by the end of the period than under ACA. That also ignores the immediate spike and throwing millions out of the insurance marketplace altogether. Now then, where does this 10% number come from? First of all, you have to understand the older people are substantially more expensive to the health system, and therefore more expensive to insure. ACA limited the price differential from the youngest to the oldest people, which meant that in effect some young people were subsidizing older ones on the individual market. The GOP plan removes that limit. Combined with other changes in subsidies and tax credits, this dramatically increases the cost to older people. For instance, the New York Times article cites a CBO estimate that the price an average 64-year-old earning $26,500 would need to pay after using a subsidy would increase from $1,700 under Obamacare to $14,600 under the Republican plan. They further conclude that these exceptionally high rates would be so unaffordable to older people that the older people will simply stop buying insurance on the individual market. This means that the overall risk pool of people in that market is healthier, and therefore the average price is lower. So, to sum up: the reason that insurance premiums under the GOP plan will rise at a slightly slower rate long-term is that the higher-risk people will be unable to afford insurance in the first place, leaving only the cheaper people to buy in.

12 March 2017

Mike Hommey: When the memory allocator works against you

Cloning mozilla-central with git-cinnabar requires a lot of memory. Actually too much memory to fit in a 32-bits address space. I hadn t optimized for memory use in the first place. For instance, git-cinnabar keeps sha-1s in memory as hex values (40 bytes) rather than raw values (20 bytes). When I wrote the initial prototype, it didn t matter that much, and while close(ish) to the tipping point, it didn t require more than 2GB of memory at the time. Time passed, and mozilla-central grew. I suspect the recent addition of several thousands of commits and files has made things worse. In order to come up with a plan to make things better (short or longer term), I needed data. So I added some basic memory resource tracking, and collected data while cloning mozilla-central. I must admit, I was not ready for what I witnessed. Follow me for a tale of frustrations (plural). I was expecting things to have gotten worse on the master branch (which I used for the data collection) because I am in the middle of some refactoring and did many changes that I was suspecting might have affected memory usage. I wasn t, however, expecting to see the clone command using 10GB(!) memory at peak usage across all processes.

(Note, those memory sizes are RSS, minus shared ) It also was taking an unexpected long time, but then, I hadn t cloned a large repository like mozilla-central from scratch in a while, so I wasn t sure if it was just related to its recent growth in size or otherwise. So I collected data on 0.4.0 as well.

Less time spent, less memory usage ok. There s definitely something wrong on master. But wait a minute, that slope from ~2GB to ~4GB on the git-remote-hg process doesn t actually make any kind of sense. I mean, I d understand it if it were starting and finishing with the Import manifest phase, but it starts in the middle of it, and ends long before it finishes. WTH? First things first, since RSS can be a variety of things, I checked /proc/$pid/smaps and confirmed that most of it was, indeed, the heap. That s the point where you reach for Google, type something like python memory profile and find various tools. One from the results that I remembered having used in the past is guppy s heapy. Armed with pdb, I broke execution in the middle of the slope, and tried to get memory stats with heapy. SIGSEGV. Ouch. Let s try something else. I reached out to objgraph and pympler. SIGSEGV. Ouch again. Tried working around the crashes for a while (too long while, retrospectively, hindsight is 20/20), and was somehow successful at avoiding them by peaking at a smaller set of objects. But whatever I did, despite being attached to a process that had 2.6GB RSS, I wasn t able to find more than 1.3GB of data. This wasn t adding up. It surely didn t help that getting to that point took close to an hour each time. Retrospectively, I wish I had investigated using something like Checkpoint/Restore in Userspace. Anyways, after a while, I decided that I really wanted to try to see the whole picture, not smaller peaks here and there that might be missing something. So I resolved myself to look at the SIGSEGV I was getting when using pympler, collecting a core dump when it happened. Guess what? The Debian python-dbg package does not contain the debug symbols for the python package. The core dump was useless. Since I was expecting I d have to fix something in python, I just downloaded its source and built it. Ran the command again, waited, and finally got a backtrace. First Google hit for the crashing function? The exact (unfixed) crash reported on the python bug tracker. No patch. Crashing code is doing:

((f)->f_builtins != (f)->f_tstate->interp->builtins)

And (f)->f_tstate is NULL. Classic NULL deref. Added a guard (assessing it wouldn t break anything). Ran the command again. Waited. Again. SIGSEGV. Facedesk. Another crash on the same line. Did I really use the patched python? Yes. But this time (f)->f_tstate->interp is NULL. Sigh. Same player, shoot again. Finally, no crash but still stuck on only 1.3GB accounted for. Ok, I know not all python memory profiling tools are entirely reliable, let s try heapy again. SIGSEGV. Sigh. No debug info on the heapy module, where the crash happens. Sigh. Rebuild the module with debug info, try again. The backtrace looks like heapy is recursing a lot. Look at %rsp, compare with the address space from /proc/$pid/maps. Confirmed. A stack overflow. Let s do ugly things and increase the stack size in brutal ways. Woohoo! Now heapy tells me there s even less memory used than the 1.3GB I found so far. Like, half less. Yeah, right. I m not clear on how I got there, but that s when I found gdb-heap, a tool from Red Hat s David Malcolm, and the associated talk Dude, where s my RAM? A deep dive into how Python uses memory (slides). With a gdb attached, I would finally be able to rip python s guts out and find where all the memory went. Or so I thought. The gdb-heap tool only found about 600MB. About as much as heapy did, for that matter, but it could be coincidental. Oh. Kay. I don t remember exactly what went through my mind then, but, since I was attached to a running process with gdb, I typed the following on the gdb prompt:

gdb> call malloc_stats()

And that s when the truth was finally unvealed: the memory allocator was just acting up the whole time. The ouput was something like:

Arena 0:
system bytes    =  some number above (but close to) 2GB
in use bytes    =  some number above (but close to) 600MB

Yes, the glibc allocator was just telling it had allocated 600MB of memory, but was holding onto 2GB. I must have found a really bad allocation pattern that causes massive fragmentation. One thing that David Malcolm s talk taught me, though, is that python uses its own allocator for small sizes, so the glibc allocator doesn t know about them. And, roughly, adding the difference between RSS and what glibc said it was holding to to the use bytes it reported somehow matches the 1.3GB I had found so far. So it was time to see how those things evolved in time, during the entire clone process. I grabbed some new data, tracking the evolution of system bytes and in use bytes .

There are two things of note on this data:

There is a relatively large gap between what the glibc allocator says it has gotten from the system, and the RSS (minus shared ) size, that I m expecting corresponds to the small allocations that python handles itself.
Actual memory use is going down during the Import manifests phase, contrary to what the evolution of RSS suggests.

In fact, the latter is exactly how git-cinnabar is supposed to work: It reads changesets and manifests chunks, and holds onto them while importing files. Then it throws away those manifests and changesets chunks one by one while it imports them. There is, however, some extra bookkeeping that requires some additional memory, but it s expected to be less memory consuming than keeping all the changesets and manifests chunks in memory. At this point, I thought a possible explanation is that since both python and glibc are mmap()ing their own arenas, they might be intertwined in a way that makes things not go well with the allocation pattern happening during the Import manifest phase (which, in fact, allocates and frees increasingly large buffers for each manifest, as manifests grow in size in the mozilla-central history). To put the theory at work, I patched the python interpreter again, making it use malloc() instead of mmap() for its arenas.

Aha! I thought. That definitely looks much better. Less gap between what glibc says it requested from the system and the RSS size. And, more importantly, no runaway increase of memory usage in the middle of nowhere. I was preparing myself to write a post about how mixing allocators could have unintended consequences. As a comparison point, I went ahead and ran another test, with the python allocator entirely disabled, this time.

Heh. It turns out glibc was acting up all alone. So much for my (plausible) theory. (I still think mixing allocators can have unintended consequences.) (Note, however, that the reason why the python allocator exists is valid: without it, the overall clone took almost 10 more minutes) And since I had been getting all this data with 0.4.0, I gathered new data without the python allocator with the master branch.

This paints a rather different picture than the original data on that branch, with much less memory use regression than one would think. In fact, there isn t much difference, except for the spike at the end, which got worse, and some of the noise during the Import manifests phase that got bigger, implying larger amounts of temporary memory used. The latter may contribute to the allocation patterns that throw glibc s memory allocator off. It turns out tracking memory usage in python 2.7 is rather painful, and not all the tools paint a complete picture of it. I hear python 3.x is somewhat better in that regard, and I hope it s true, but at the moment, I m stuck with 2.7. The most reliable tool I ve used here, it turns out, is pympler. Or rebuilding the python interpreter without its allocator, and asking the system allocator what is allocated. With all this data, I now have some defined problems to tackle, some easy (the spike at the end of the clone), and some less easy (working around glibc allocator s behavior). I have a few hunches as to what kind of allocations are causing the runaway increase of RSS. Coincidentally, I m half-way through a refactor of the code dealing with manifests, and it should help dealing with the issue. But that will be the subject of a subsequent post.

Next.

Previous.

Search Results: "pik"

27 July 2021

Changes in version 0.0.1 (2021-07-25) Initial version and CRAN upload

22 July 2021

18 July 2021

13 May 2021

26 April 2021

11 April 2021

Major monasteries Rumtek Monastery, 20Km from Gangtok Lingdum/Ranka Monastery, 17Km from Gangtok Phodong Monastery, 28Km from Gangtok Ralang Monastery, 10Km from Ravangla Tsuklakhang Monastery, Royal Palace, Gangtok Enchey Monastery, Gangtok Tashiding Monastery, 35Km from Ravangla

East Sikkim

West Sikkim Time needed: 3N/1N Hostels at Pelling : Mochilerro Ostillo

Itinerary

Day 1: Gangtok - Ravangla - Pelling Leave Gangtok early, for Ravangla through the Temi Tea Estate route. Spend some time at the tea garden and then visit Buddha Park at Ravangla Head to Pelling from Ravangla

Day 2: Pelling sightseeing Hire a cab and visit Skywalk, Pemayangtse Monastery, Rabdentse Ruins, Kecheopalri Lake, Kanchenjunga Falls.

Day 3: Pelling - Gangtok/Siliguri Wake up early to catch a glimpse of Kanchenjunga at the Pelling Helipad around sunrise Head back to Gangtok on a shared-cab You could take a bus/taxi back to Siliguri if Pelling is your last stop.

Itinerary

Shared-cab hacks Intercity rides can be exhausting. If you can afford it, pay for an additional seat. Call shotgun on the drives beyond Lachen and Lachung. The views are breathtaking. Return cabs tend to be cheaper (WB cabs travelling from SK and vice-versa)

Buddhist souvenirs : Colourful Prayer Flags (great for tying on bikes or behind car windshields) Miniature Prayer/Mani Wheels Lucky Charms, Pendants and Key Chains Cham Dance masks and robes Singing Bowls Common symbols: Om mani padme hum, Ashtamangala, Zodiac signs

Handicrafts & Handlooms Tibetan Yak Wool shawls, scarfs and carpets Sikkimese Ceramic cups Thangka Paintings

Edibles Darjeeling Tea (usually brewed and not boiled) Wine (Arucha Peach & Rhododendron) Dalle Khursani (Chilli) Paste and Pickle

6 March 2021

curl example curl \ -H "x-dcs-apikey: $(cat dcs-apikey-stapelberg.txt)" \ -X GET \ "https://codesearch.debian.net/api/v1/search?query=i3Font&match_mode=regexp"

Web browser example You can try out the API in your web browser in the OpenAPI documentation.

Feedback? File a GitHub issue on github.com/Debian/dcs please!

23 September 2020

18 September 2020

17 May 2020

17 March 2020

6 March 2020

Distribution work

Software development

16 November 2017

23 August 2017

1 April 2017

28 March 2017

23 March 2017

14 March 2017

12 March 2017

Changes in version 0.0.1 (2021-07-25)

Initial version and CRAN upload

Major monasteries

Rumtek Monastery, 20Km from Gangtok

Lingdum/Ranka Monastery, 17Km from Gangtok

Phodong Monastery, 28Km from Gangtok

Ralang Monastery, 10Km from Ravangla

Tsuklakhang Monastery, Royal Palace, Gangtok

Enchey Monastery, Gangtok

Tashiding Monastery, 35Km from Ravangla

West Sikkim

Time needed: 3N/1N

Hostels at Pelling : Mochilerro Ostillo

Day 1: Gangtok - Ravangla - Pelling

Leave Gangtok early, for Ravangla through the Temi Tea Estate route.

Spend some time at the tea garden and then visit Buddha Park at Ravangla

Head to Pelling from Ravangla

Day 2: Pelling sightseeing

Hire a cab and visit Skywalk, Pemayangtse Monastery, Rabdentse Ruins, Kecheopalri Lake, Kanchenjunga Falls.

Day 3: Pelling - Gangtok/Siliguri

Wake up early to catch a glimpse of Kanchenjunga at the Pelling Helipad around sunrise

Head back to Gangtok on a shared-cab

You could take a bus/taxi back to Siliguri if Pelling is your last stop.

Shared-cab hacks

Intercity rides can be exhausting. If you can afford it, pay for an additional seat.

Call shotgun on the drives beyond Lachen and Lachung. The views are breathtaking.

Return cabs tend to be cheaper (WB cabs travelling from SK and vice-versa)

Buddhist souvenirs :

Colourful Prayer Flags (great for tying on bikes or behind car windshields)

Miniature Prayer/Mani Wheels

Lucky Charms, Pendants and Key Chains

Cham Dance masks and robes

Singing Bowls

Common symbols: Om mani padme hum, Ashtamangala, Zodiac signs

Handicrafts & Handlooms

Tibetan Yak Wool shawls, scarfs and carpets

Sikkimese Ceramic cups

Thangka Paintings

Edibles

Darjeeling Tea (usually brewed and not boiled)

Wine (Arucha Peach & Rhododendron)

Dalle Khursani (Chilli) Paste and Pickle

curl example
`curl \ -H "x-dcs-apikey: $(cat dcs-apikey-stapelberg.txt)" \ -X GET \ "https://codesearch.debian.net/api/v1/search?query=i3Font&match_mode=regexp"`

Feedback? File a GitHub issue on `github.com/Debian/dcs` please!